2024 · arXiv / imported corpus page · Field expert review · confidence high

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full text + structured benchmark + summaryCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: Introduces a modular audiovisual sound synthesis pipeline that replaces monolithic video-to-audio regression with interpretable video-to-text event detection plus timestamped text-to-audio generation, enabling strong synchronization and user-editable sound design.
What to trust: Basis: full text + structured benchmark + summary. Coverage: high. 4 evidence records back the review.
What is weak: Visual understanding and timestamp detection require further refinement; broader control of audio generation and editability beyond current focus remains an open challenge; model computationally intensive and not real-time. Benchmarks use zero-shot datasets with mixed or noisy audio-ground truths complicating metric interpretation; subjective metrics complement objective evaluations but lack broad unseen word/generalization studies. Designed for post-production workflows; lacks real-time interactive authoring system and requires GPU-level compute; relies on offline processing and does not address mobile or embedded deployment. An audiovisual generation paper outside the silent speech interface domain; focuses on sound effects for video post-production rather than silent speech communication devices or sensing. Overclaim risk: low-medium.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: audio-classification; dataset; multimodal_generation
Modality: video
Hardware: video camera frames only; no specialized silent speech sensors
Output: audio
Vocabulary: Open text prompts grounded by sound-effect categories and timestamped event descriptions
Metrics: Exact reported values include conditional generation CLAP-top scores (36.8% and 42.8%), Onset Accuracy (27.6%), Onset AP (78.1%), Time Accuracy (43.8%), Intersection over Union (39.7%). Unconditional generation IoU scores of 39.5 and 42.0 on Greatest Hits and CountixAV datasets respectively. Subjective scores of Overall Audio Quality (75), Relevance (69), and Time-Sync (87). Also FID scores under 25 and MKL scores around 2.3 for time-conditioned adapter ablation.
Evaluation mode: Quantitative objective metrics (CLAP-top, Onset accuracy, IoU, FID, MKL), zero-shot on Greatest Hits and CountixAV datasets; 300-person large-scale subjective study evaluating audio quality, relevance, and synchronization; ablation on time-conditioned adapter.
Review confidence: high
Overclaim risk: low-medium

Expert take

SonicVisionLM advances video-to-audio generation by creatively decomposing the task: a vision-language model proposes plausible sound event labels from silent video frames, a timestamp detection module accurately localizes events in time, and a novel time-controllable latent diffusion adapter conditions generation on synchronized text and timing inputs. The extensive CondPromptBank dataset supports training, focused on single sound effects with precise timestamps and rich textual descriptions. Quantitatively, the model outperforms prior state-of-the-art CondFoleyGen and other baselines in all key metrics, including semantic match, timing accuracy, and subjective human ratings for alignment and quality. The system’s multi-soundtrack generation supports user customization to add off-screen sounds, mirroring professional post-production workflows. Despite these advances, the authors acknowledge the need to enhance visual understanding and timestamp prediction further and the lack of real-time or mobile deployment readiness. This work represents a substantial contribution to controllable audiovisual synthesis but is outside SSI core, focusing on video-derived sound effects rather than silent speech sensing.

True value

Introduces a modular audiovisual sound synthesis pipeline that replaces monolithic video-to-audio regression with interpretable video-to-text event detection plus timestamped text-to-audio generation, enabling strong synchronization and user-editable sound design.

What changed

Canon before

Prior video-sound generation methods directly aligned video features to audio as a monolithic regression task, often yielding poor semantic matching, timing synchronization, and lacking user-editable control.

Delta from canon

Transforms video-sound generation into a decomposed pipeline of video-to-text event detection (via VLM), timestamp sound localization, and time-conditioned text-to-audio generation with user-editable controls.

Position in field

A strong audiovisual generation and dataset contribution that is adjacent to but not within core silent speech interface research.

Evidence

“ Current meth- • We propose a novel framework called SonicVisionLM ods use datasets including sound effects, voices, and mu- and collect a dataset CondPromptBank specifically for sic, but practical applications use these elements separately. training a time-controllable adapter. ”

author_claim · Abstract · confidence 1.00

“ We attribute these shortcom- stead of using a hand-crafted approach to transfer sounds ings to the complexity of the sound sources and the poor from the conditional audio, We use a ResNet(2+1)-D18 [35] audio quality of the audio-visual dataset used for training. visual network to capture timestamps as time-conditional To address this, we use text to bridge audio and video and inputs to the LDM, trained on paired video and timestamp. then introduce time control in the T2A generation model. ”

actual_novelty · 3. Method · confidence 1.00

“ For the conditional generation task, Animals Insects 4.04 Instruments 3.68 Water Liquid 3.27 Technology 2.70 Horror 2.41 Emergency 2.20 we use the following five objective metrics to evaluate the Public Places Fire Explosions 1.87 1.49 Sound Design Effects Nature Weather 1.69 1.02 Doors Windows Leisure 1.56 0.84 performance of the model: CLAP-top, Onset Acc [9], Onset Multimedia 0.47 Bells 0.37 AP [9], Time Acc, and IoU. ”

metric · 4. Experiments · confidence 1.00

“ We attribute these shortcom- stead of using a hand-crafted approach to transfer sounds ings to the complexity of the sound sources and the poor from the conditional audio, We use a ResNet(2+1)-D18 [35] audio quality of the audio-visual dataset used for training. visual network to capture timestamps as time-conditional To address this, we use text to bridge audio and video and inputs to the LDM, trained on paired video and timestamp. then introduce time control in the T2A generation model. ”

limitation · 5. Conclusion · confidence 1.00

Limits

Technical limits

Visual understanding and timestamp detection require further refinement; broader control of audio generation and editability beyond current focus remains an open challenge; model computationally intensive and not real-time.

Evaluation limits

Benchmarks use zero-shot datasets with mixed or noisy audio-ground truths complicating metric interpretation; subjective metrics complement objective evaluations but lack broad unseen word/generalization studies.

Deployment limits

Designed for post-production workflows; lacks real-time interactive authoring system and requires GPU-level compute; relies on offline processing and does not address mobile or embedded deployment.

Scope limits

An audiovisual generation paper outside the silent speech interface domain; focuses on sound effects for video post-production rather than silent speech communication devices or sensing.