2020 · arXiv / imported corpus page · Field expert review · confidence high

Foley Music: Learning to Generate Music from Videos

Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

arXiv

Strong video-to-music paper, not SSI.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: Strong multimodal music-generation paper, but outside SSI scope.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: Scope is instrument-performance video; waveform realism still depends on an external synthesizer and future neural synthesis work. Human studies are preference-based and confined to the tested instrument/video distributions. No SSI deployment path; this is a multimedia generation system. Video-to-music generation only. Overclaim risk: low.
Read before: SSI review rubric
Read next: SSI archive

Axes

Modality: video
Hardware: camera
Output: audio
Vocabulary: MIDI event vocabulary
Metrics: Human preference rates in Table 1 favor the method in every instrument category, ranging from 56% to 72%; real-vs-fake success reaches 38% versus 8-12% for baselines; NDB is 20 versus 25-33 for baselines
Evaluation mode: human preference studies, real-vs-fake listening study, NDB diversity metric, and NLL ablations
Review confidence: high
Overclaim risk: low

Expert take

The full text supports a credible music-generation contribution: keypoints plus MIDI make synchronization and structure easier to learn than direct waveform targets, and the paper wins both preference studies and automatic diversity metrics. But none of that is silent-speech interface work. It belongs in a broader multimodal archive only if non-SSI distractors are intentionally retained and clearly labeled as such.

True value

Strong multimodal music-generation paper, but outside SSI scope.

What changed

Canon before

Video-to-sound generation often worked in waveform or spectrogram space and struggled to align long-term musical structure with body motion.

Delta from canon

Uses body keypoints and MIDI as intermediate representations, turning video-to-music generation into a motion-to-MIDI translation problem.

Position in field

Audio-visual generation paper that can distract an SSI corpus if not explicitly labeled out-of-scope.

Evidence

“ In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing mu- sical instruments. ”

author_claim · Abstract. In this paper, we introduce Foley Music, a system that can · confidence 0.99

“ Qualitative Evaluation with Human Study: Similar to the task of image or video generation, the quality of the generated sound can be very subjective. ”

metric · Table 1. Human evaluation on model comparisons. · confidence 0.97

“ Quantitative Evaluation with Automatic Metrics We adopt the Num- ber of Statistically-Different Bins (NDB) [13] as automatic metrics to evaluate the diversity of generated sound. ”

metric · Table 3. Automatic metrics for different models. For NDB, lower is better. · confidence 0.97

Limits

Technical limits

Scope is instrument-performance video; waveform realism still depends on an external synthesizer and future neural synthesis work.

Evaluation limits

Human studies are preference-based and confined to the tested instrument/video distributions.

Deployment limits

No SSI deployment path; this is a multimedia generation system.

Scope limits

Video-to-music generation only.