Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.
Reading guidance
- Verdict
- full-text draft · priority medium · confidence high
- Why it matters
- Strong neural Foley paper, outside SSI: its main contribution is aligned-feature conditioning for diffusion-based V2A, with unusually good synchronization numbers for the time.
- What to trust
- Basis: full text + structured benchmark + summary. Coverage: high. 4 evidence records back the review.
- What is weak
- Diffusion remains heavier than GAN baselines, and the paper explicitly says scalability to super-large datasets is still untested. Main quantitative evidence is on VGGSound, with downstream EPIC-Kitchens evidence mostly qualitative; the work is not evaluated for SSI or human communication outcomes. Promising for offline production assistance, but still too heavy and domain-specific for lightweight deployment claims. Video-to-audio Foley synthesis only; outside SSI scope. Overclaim risk: medium-low.
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- video-to-audio generation
- Modality
- video
- Output
- audio
- Metrics
- Table 1: with double guidance the model reaches IS 62.37, FID 9.87, KL 6.43, Align Acc 94.05, and 0.38s average inference time per sample with DPM-Solver at 25 steps. Table 5: scaling Stage 1 to VGGSound+AudioSet-V2A pushes Align Acc to 94.78 under DDIM evaluation.
- Evaluation mode
- VGGSound quantitative evaluation, ablations on guidance and pretraining scale, sampler-speed study, and downstream fine-tuning on EPIC-Kitchens
- Review confidence
- high
- Overclaim risk
- medium-low
Expert take
The full text shows a more complete system than the earlier draft suggested. CAVP is used to make visual features carry audio-related timing information before generation starts, and double guidance then sharpens sampling quality. The result is strong on the paper's chosen benchmarks: a large jump over SpecVQGAN on IS and a very high alignment accuracy while using only 4 FPS video and fast DPM-Solver sampling. The downstream EPIC-Kitchens section also makes the generalization claim more credible. The main caveat is that this is still a compute-heavy neural Foley stack, not silent speech, and the authors explicitly say billion-scale scaling is untested.
True value
Strong neural Foley paper, outside SSI: its main contribution is aligned-feature conditioning for diffusion-based V2A, with unusually good synchronization numbers for the time.
What changed
Canon before
Earlier V2A systems improved semantic relevance but struggled to make generated sounds temporally align with what the video was actually doing.
Delta from canon
The paper treats audio-visual alignment as a first-class pretraining objective and pairs it with latent diffusion plus double guidance to raise synchronization and quality together.
Position in field
Competitive V2A foundation-style model for synchronized Foley generation.
Evidence
“ We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. ”
author_claim · Abstract · confidence 0.99
“ T IME↓ IS ↑ FID ↓ KL ↓ ACC (%) ↑ SpecVQGAN [19] RGB + Flow 21.5 ✘ 30.01 8.93 6.93 52.94 5.47s SpecVQGAN [19] ResNet50 21.5 ✘ 30.80 9.70 7.03 49.19 5.47s Im2Wav [39] CLIP 30 CFG (✔) 39.30 11.44 5.20 67.40 6.41s D IFF -F OLEY (Ours) CAVP 4 CFG (✔) 53.34 11.22 6.36 92.67 0.38s D IFF -F OLEY (Ours) CAVP 4 Double (✔✔) 62.37 9.87 6.43 94.05 0.38s Table 1: Video-to-Audio generation evaluation results with CFG scale ω = 4.5, CG scale γ = 50, using DPM-Solver [28] Sampler with 25 inference steps. ”
metric · Table 1 · confidence 0.99
“ The generated sounds closely match the ground truth, especially in terms of timing, such as knife cutting, water flow, and plate clinking (refer to Generated Audio and Ground Truth Audio). ”
validation_scope · 4.2 Downstream Finetuning · confidence 0.97
“ 5 Limitations and Broader Impact Limitations D IFF -F OLEY has shown great audio-visual synchronization on VGGSound and EPIC- Kitchens, however its scalability on super large (billion-scale) datasets remains untested due to limited ”
limitation · 5 Limitations and Broader Impact · confidence 0.96
Limits
Technical limits
Diffusion remains heavier than GAN baselines, and the paper explicitly says scalability to super-large datasets is still untested.
Evaluation limits
Main quantitative evidence is on VGGSound, with downstream EPIC-Kitchens evidence mostly qualitative; the work is not evaluated for SSI or human communication outcomes.
Deployment limits
Promising for offline production assistance, but still too heavy and domain-specific for lightweight deployment claims.
Scope limits
Video-to-audio Foley synthesis only; outside SSI scope.