Multimodal
This page groups the current SSI review database by the real `modality:` tag `modality:multimodal`.
The list below includes every paper page that currently carries this technique label.
Papers
NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction
A strong deployment-focused speech interface leveraging a novel nose-pad dual-sensor configuration and multimodal fusion to enable robust low-audibility speech interaction with AI under noise, backed by extensive evaluation.
SonicVisionLM: Playing Sound with Vision Language Models
A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.
Sound Source Localization is All about Cross-Modal Alignment
Provides a novel multi-positive contrastive framework enhancing semantic audio-visual alignment for sound source localization. Strong experimental evidence supports claims. Method is outside the SSI domain.
Audio-visual video-to-speech synthesis with synthesized input audio
The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Strong AVS result, outside SSI: the useful idea is audio-conditioned decoder queries plus dynamic mask prediction.
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.
Conditional Generation of Audio from Video via Foley Analogies
The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.
Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training
Strong SSI paper improving silent speech reconstruction by generating pseudo acoustic targets and using domain adversarial training to address domain mismatch; validated with TaL dataset showing substantial WER and MOS gains over TaLNet.
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video
The key idea is not generic fusion; it is storing cross-modal correspondences so video-only decoding can recover some audio-side structure later.
Silent versus modal multi-speaker speech recognition from ultrasound and video
Large-corpus baseline with real silent-mode gap.
Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching
Technically solid self-supervised class-aware audiovisual sounding object localization, but outside the core SSI domain.
Silent Speech Interfaces for Speech Restoration: A Review
Core SSI survey with concrete deployment constraints.
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Strong AV speech survey, not an SSI system paper.
Foley Music: Learning to Generate Music from Videos
Strong video-to-music paper, not SSI.
Cross-modal Embeddings for Video and Audio Retrieval
Useful multimodal retrieval baseline, not SSI.