Foley Music: Learning to Generate Music from Videos
Strong video-to-music paper, not SSI.
Reading guidance
- Verdict
- full-text draft · priority medium · confidence high
- Why it matters
- Strong multimodal music-generation paper, but outside SSI scope.
- What to trust
- Basis: full text. Coverage: high. 3 evidence records back the review.
- What is weak
- Scope is instrument-performance video; waveform realism still depends on an external synthesizer and future neural synthesis work. Human studies are preference-based and confined to the tested instrument/video distributions. No SSI deployment path; this is a multimedia generation system. Video-to-music generation only. Overclaim risk: low.
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Modality
- video
- Hardware
- camera
- Output
- audio
- Vocabulary
- MIDI event vocabulary
- Metrics
- Human preference rates in Table 1 favor the method in every instrument category, ranging from 56% to 72%; real-vs-fake success reaches 38% versus 8-12% for baselines; NDB is 20 versus 25-33 for baselines
- Evaluation mode
- human preference studies, real-vs-fake listening study, NDB diversity metric, and NLL ablations
- Review confidence
- high
- Overclaim risk
- low
Expert take
The full text supports a credible music-generation contribution: keypoints plus MIDI make synchronization and structure easier to learn than direct waveform targets, and the paper wins both preference studies and automatic diversity metrics. But none of that is silent-speech interface work. It belongs in a broader multimodal archive only if non-SSI distractors are intentionally retained and clearly labeled as such.
True value
Strong multimodal music-generation paper, but outside SSI scope.
What changed
Canon before
Video-to-sound generation often worked in waveform or spectrogram space and struggled to align long-term musical structure with body motion.
Delta from canon
Uses body keypoints and MIDI as intermediate representations, turning video-to-music generation into a motion-to-MIDI translation problem.
Position in field
Audio-visual generation paper that can distract an SSI corpus if not explicitly labeled out-of-scope.
Evidence
“ In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing mu- sical instruments. ”
author_claim · Abstract. In this paper, we introduce Foley Music, a system that can · confidence 0.99
“ Qualitative Evaluation with Human Study: Similar to the task of image or video generation, the quality of the generated sound can be very subjective. ”
metric · Table 1. Human evaluation on model comparisons. · confidence 0.97
“ Quantitative Evaluation with Automatic Metrics We adopt the Num- ber of Statistically-Different Bins (NDB) [13] as automatic metrics to evaluate the diversity of generated sound. ”
metric · Table 3. Automatic metrics for different models. For NDB, lower is better. · confidence 0.97
Limits
Technical limits
Scope is instrument-performance video; waveform realism still depends on an external synthesizer and future neural synthesis work.
Evaluation limits
Human studies are preference-based and confined to the tested instrument/video distributions.
Deployment limits
No SSI deployment path; this is a multimedia generation system.
Scope limits
Video-to-music generation only.