← Technique taxonomy

modality:multimodal 15 pages 15 reviewed 0 imported

Multimodal

This page groups the current SSI review database by the real `modality:` tag `modality:multimodal`.

The list below includes every paper page that currently carries this technique label.

Papers

reviewedCHI '26 / arXiv2026

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

Jun Rekimoto, Yu Nishimura, Bojian Yang

A strong deployment-focused speech interface leveraging a novel nose-pad dual-sensor configuration and multimodal fusion to enable robust low-audibility speech interaction with AI under noise, backed by extensive evaluation.

reviewedarXiv / imported corpus page2024

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.

reviewedarXiv / imported corpus page2023

Sound Source Localization is All about Cross-Modal Alignment

Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

Provides a novel multi-positive contrastive framework enhancing semantic audio-visual alignment for sound source localization. Strong experimental evidence supports claims. Method is outside the SSI domain.

reviewedarXiv / imported corpus page2023

Audio-visual video-to-speech synthesis with synthesized input audio

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.

reviewedarXiv / imported corpus page2023

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya Zhang

Strong AVS result, outside SSI: the useful idea is audio-conditioned decoder queries plus dynamic mask prediction.

reviewedarXiv / imported corpus page2023

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.

reviewedarXiv / imported corpus page2023

Conditional Generation of Audio from Video via Foley Analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.

reviewedarXiv / imported corpus page2023

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Strong SSI paper improving silent speech reconstruction by generating pseudo acoustic targets and using domain adversarial training to address domain mismatch; validated with TaL dataset showing substantial WER and MOS gains over TaLNet.

reviewedarXiv / imported corpus page2022

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

The key idea is not generic fusion; it is storing cross-modal correspondences so video-only decoding can recover some audio-side structure later.

reviewedarXiv / imported corpus page2021

Silent versus modal multi-speaker speech recognition from ultrasound and video

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

Large-corpus baseline with real silent-mode gap.

reviewedarXiv / imported corpus page2020

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

Technically solid self-supervised class-aware audiovisual sounding object localization, but outside the core SSI domain.

reviewedarXiv / imported corpus page2020

Silent Speech Interfaces for Speech Restoration: A Review

José A. González, Alejandro Gomez-Alanis, Juan M. Martín-Doñas, José L. Pérez-Córdoba, Ángel M. Gómez

Core SSI survey with concrete deployment constraints.

reviewedarXiv / imported corpus page2020

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti, Zheng‐Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

Strong AV speech survey, not an SSI system paper.

reviewedarXiv / imported corpus page2020

Foley Music: Learning to Generate Music from Videos

Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

Strong video-to-music paper, not SSI.

reviewedarXiv / imported corpus page2018

Cross-modal Embeddings for Video and Audio Retrieval

Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Giró Nieto, Xavier

Useful multimodal retrieval baseline, not SSI.