2023 · arXiv / imported corpus page · Field expert review · confidence high

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

arXiv

The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full text + structured benchmark + summaryCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: Strong neural Foley paper, outside SSI: its main contribution is aligned-feature conditioning for diffusion-based V2A, with unusually good synchronization numbers for the time.
What to trust: Basis: full text + structured benchmark + summary. Coverage: high. 4 evidence records back the review.
What is weak: Diffusion remains heavier than GAN baselines, and the paper explicitly says scalability to super-large datasets is still untested. Main quantitative evidence is on VGGSound, with downstream EPIC-Kitchens evidence mostly qualitative; the work is not evaluated for SSI or human communication outcomes. Promising for offline production assistance, but still too heavy and domain-specific for lightweight deployment claims. Video-to-audio Foley synthesis only; outside SSI scope. Overclaim risk: medium-low.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: video-to-audio generation
Modality: video
Output: audio
Metrics: Table 1: with double guidance the model reaches IS 62.37, FID 9.87, KL 6.43, Align Acc 94.05, and 0.38s average inference time per sample with DPM-Solver at 25 steps. Table 5: scaling Stage 1 to VGGSound+AudioSet-V2A pushes Align Acc to 94.78 under DDIM evaluation.
Evaluation mode: VGGSound quantitative evaluation, ablations on guidance and pretraining scale, sampler-speed study, and downstream fine-tuning on EPIC-Kitchens
Review confidence: high
Overclaim risk: medium-low

Expert take

The full text shows a more complete system than the earlier draft suggested. CAVP is used to make visual features carry audio-related timing information before generation starts, and double guidance then sharpens sampling quality. The result is strong on the paper's chosen benchmarks: a large jump over SpecVQGAN on IS and a very high alignment accuracy while using only 4 FPS video and fast DPM-Solver sampling. The downstream EPIC-Kitchens section also makes the generalization claim more credible. The main caveat is that this is still a compute-heavy neural Foley stack, not silent speech, and the authors explicitly say billion-scale scaling is untested.

True value

Strong neural Foley paper, outside SSI: its main contribution is aligned-feature conditioning for diffusion-based V2A, with unusually good synchronization numbers for the time.

What changed

Canon before

Earlier V2A systems improved semantic relevance but struggled to make generated sounds temporally align with what the video was actually doing.

Delta from canon

The paper treats audio-visual alignment as a first-class pretraining objective and pairs it with latent diffusion plus double guidance to raise synchronization and quality together.

Position in field

Competitive V2A foundation-style model for synchronized Foley generation.

Evidence

“ We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. ”

author_claim · Abstract · confidence 0.99

“ T IME↓ IS ↑ FID ↓ KL ↓ ACC (%) ↑ SpecVQGAN [19] RGB + Flow 21.5 ✘ 30.01 8.93 6.93 52.94 5.47s SpecVQGAN [19] ResNet50 21.5 ✘ 30.80 9.70 7.03 49.19 5.47s Im2Wav [39] CLIP 30 CFG (✔) 39.30 11.44 5.20 67.40 6.41s D IFF -F OLEY (Ours) CAVP 4 CFG (✔) 53.34 11.22 6.36 92.67 0.38s D IFF -F OLEY (Ours) CAVP 4 Double (✔✔) 62.37 9.87 6.43 94.05 0.38s Table 1: Video-to-Audio generation evaluation results with CFG scale ω = 4.5, CG scale γ = 50, using DPM-Solver [28] Sampler with 25 inference steps. ”

metric · Table 1 · confidence 0.99

“ The generated sounds closely match the ground truth, especially in terms of timing, such as knife cutting, water flow, and plate clinking (refer to Generated Audio and Ground Truth Audio). ”

validation_scope · 4.2 Downstream Finetuning · confidence 0.97

“ 5 Limitations and Broader Impact Limitations D IFF -F OLEY has shown great audio-visual synchronization on VGGSound and EPIC- Kitchens, however its scalability on super large (billion-scale) datasets remains untested due to limited ”

limitation · 5 Limitations and Broader Impact · confidence 0.96

Limits

Technical limits

Diffusion remains heavier than GAN baselines, and the paper explicitly says scalability to super-large datasets is still untested.

Evaluation limits

Main quantitative evidence is on VGGSound, with downstream EPIC-Kitchens evidence mostly qualitative; the work is not evaluated for SSI or human communication outcomes.

Deployment limits

Promising for offline production assistance, but still too heavy and domain-specific for lightweight deployment claims.

Scope limits

Video-to-audio Foley synthesis only; outside SSI scope.