2023 · arXiv / imported corpus page · Field expert review · confidence high

Conditional Generation of Audio from Video via Foley Analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

arXiv

The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full text + structured benchmark + summaryCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: The durable contribution is controlled neural Foley, not SSI: the paper makes exemplar-conditioned soundtrack generation plausible enough for assistive sound-design workflows, but timing fidelity remains modest.
What to trust: Basis: full text + structured benchmark + summary. Coverage: high. 4 evidence records back the review.
What is weak: Synchronization is still fragile, generation quality depends on sampling and re-ranking, and the pipeline is aimed at soundtrack synthesis rather than communication. The strongest quantitative evidence comes from Greatest Hits; CountixAV and wild-video results are qualitative, and onset transfer still wins the synchronization subtask. This is an offline creative tool pipeline rather than a real-time SSI system. Conditional Foley generation only; outside SSI scope. Overclaim risk: medium.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: conditional foley generation
Modality: video + audio conditioning
Output: audio
Metrics: Table 1: Ours w/ re-rank reaches 44.0% overall material accuracy, 66.7% overall action accuracy, 25.3% onset-count accuracy, and 54.3 AP onset synchronization on Greatest Hits. Table 2: in human study, re-ranked outputs are preferred over the base model 54.3% on material and 53.8% on synchronization.
Evaluation mode: Greatest Hits quantitative evaluation, Amazon Mechanical Turk perceptual study, and qualitative transfer to CountixAV and in-the-wild videos
Review confidence: high
Overclaim risk: medium

Expert take

The full text supports a more precise reading than the summary-only draft. The important move is the training formulation: two clips from the same source video create self-supervised conditioning pairs, and test-time re-ranking uses a separate sync model to choose from many generations. That yields real control over material/timbre cues, but the quantitative story is mixed. The re-ranked model improves material and action metrics over unconditional variants, yet onset-transfer still wins on synchronization because it is engineered for that subproblem. So the paper is a meaningful controllable Foley result, but not a solved synchronization paper and not SSI work.

True value

The durable contribution is controlled neural Foley, not SSI: the paper makes exemplar-conditioned soundtrack generation plausible enough for assistive sound-design workflows, but timing fidelity remains modest.

What changed

Canon before

Prior video-to-audio systems predicted a video's co-occurring sound but gave little artist control over what the result should sound like.

Delta from canon

It reframes Foley as analogy-based conditional generation and adds a self-supervised training recipe plus sync-based re-ranking to make exemplar control usable at inference.

Position in field

Early strong conditional Foley paper that adds user control to video-to-audio generation.

Evidence

“ We show through human studies and soundtrack for an input silent video from a user-provided automated evaluation metrics that our model successfully conditional audio-visual example that specifies what the generates sound from videos, while varying its output ac- input video should “sound like.” The generated soundtrack cording to the content of a supplied example. ”

author_claim · Abstract · confidence 0.99

“ 1, the accuracy in capturing the correct number of Additionally, we found that the synchronization re- onsets drops to the same level as the onset transfer method ranking significantly boosted performance on both material if we remove conditional information. ”

metric · Table 1. Automated evaluation metrics. · confidence 0.98

“ Interestingly, the onset transfer model performs and input action synchronization. quite well on the perceptual study, outperforming our base In Fig. ”

validation_scope · Table 2. Perceptual study results. · confidence 0.97

“ The model with no conditional occurs when the audio events are cleanly separated in time, example obtains poor performance on the material metric and we expect the model to fail when sounds are not easily but obtains a relatively smaller decrease in synchronization divided into discrete onsets, or when onsets are ambiguous. performance. ”

limitation · 5. Discussion · confidence 0.95

Limits

Technical limits

Synchronization is still fragile, generation quality depends on sampling and re-ranking, and the pipeline is aimed at soundtrack synthesis rather than communication.

Evaluation limits

The strongest quantitative evidence comes from Greatest Hits; CountixAV and wild-video results are qualitative, and onset transfer still wins the synchronization subtask.

Deployment limits

This is an offline creative tool pipeline rather than a real-time SSI system.

Scope limits

Conditional Foley generation only; outside SSI scope.