← Technique taxonomy

modality:video 42 pages 42 reviewed 0 imported

Video

This page groups the current SSI review database by the real `modality:` tag `modality:video`.

The list below includes every paper page that currently carries this technique label.

Papers

reviewedarXiv2026

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

The paper advances silent speech synthesis by leveraging masked training to robustly fuse electromyography and lipreading, showing improved performance and resilience, but adaptation to laryngectomized users remains challenging.

reviewedarXiv / imported corpus page2024

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.

reviewedarXiv / imported corpus page2023

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

Strong lip-to-speech system that reduces ambiguity via SSL linguistic conditioning, variance predictors, and flow-based refinement, achieving near-vocoded naturalness and improved intelligibility on standard datasets.

reviewedarXiv / imported corpus page2023

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Matthew Martel, Jackson Wagner

Honest exploratory comparison showing transformer-based model outperforms deep-fusion CNN and Wavenet for generating low-to-mid frequency audio from silent video in a small curated dataset; not a speech or SSI paper.

reviewedarXiv / imported corpus page2023

Audio Knowledge Empowered Visual Speech Recognition

Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

The paper advances visual speech recognition by selectively transferring refined linguistic audio knowledge via a learned compact memory and cross-attention injection, improving benchmark WERs over prior audio-assisted methods without requiring audio inputs during inference.

reviewedarXiv / imported corpus page2023

Audio-visual video-to-speech synthesis with synthesized input audio

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.

reviewedarXiv / imported corpus page2023

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya Zhang

Strong AVS result, outside SSI: the useful idea is audio-conditioned decoder queries plus dynamic mask prediction.

reviewedarXiv / imported corpus page2023

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, Vineet Gandhi

Strong modular SSL-based lip-to-speech synthesis paper that innovatively maps lip SSL features to disentangled speech embeddings before vocoder synthesis, demonstrating improved intelligibility and robustness across benchmark datasets.

reviewedarXiv / imported corpus page2023

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.

reviewedarXiv / imported corpus page2023

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

Junchen Lu, Berrak Şişman, Mingyang Zhang, Haizhou Li

This video-conditioned AVO system innovatively supervises alignment by predicting discrete speech units rather than reconstructing acoustic features, leading to better lip-sync and speech quality on a single-speaker dataset; however, it is not an SSI interface paper.

reviewedarXiv / imported corpus page2023

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

Good decoder-transfer pretraining improves video-to-speech quality on several benchmarks, but WER gains are not consistent. A useful methodological contribution with strong benchmark support, adjacent to SSI rather than a deployable system.

reviewedarXiv / imported corpus page2023

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

Strong full-text paper demonstrating that inference-time text guidance via ASR classifier is key to significantly improved intelligibility in lip-to-speech synthesis on challenging in-the-wild video datasets, outperforming prior baselines.

reviewedarXiv / imported corpus page2023

Intelligible Lip-to-Speech Synthesis with Speech Units

Jeongsoo Choi, Minsu Kim, Yong Man Ro

Speech units as a pseudo-text target enable strong content supervision that substantially cuts WER without text labels, and the multi-input vocoder improves speech quality from blurry mel outputs, yielding a state-of-the-art lip-to-speech system on LRS benchmarks.

reviewedarXiv / imported corpus page2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

Demonstrates effective zero-shot voice control in Lip2Speech by leveraging face image-based speaker embeddings, validated on GRID corpus but constrained by dataset vocabulary and speech naturalness.

reviewedarXiv / imported corpus page2023

Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

Sara Kashiwagi, Keitaro Tanaka, Feng Qi, Shigeo Morishima

Strong viseme-level metric learning approach reduces silent speech VSR errors on a small 10-phrase dataset, notably achieving parity with baselines using much less silent data.

reviewedarXiv / imported corpus page2023

Conditional Generation of Audio from Video via Foley Analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.

reviewedarXiv / imported corpus page2023

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Strong SSI paper improving silent speech reconstruction by generating pseudo acoustic targets and using domain adversarial training to address domain mismatch; validated with TaL dataset showing substantial WER and MOS gains over TaLNet.

reviewedarXiv / imported corpus page2023

LipLearner: Customizable Silent Speech Interactions on Mobile Devices

Zixiong Su, Shitao Fang, Jun Rekimoto

LipLearner is a strong mobile silent speech system that uniquely closes the loop from few-shot lipreading model design to practical on-device customization and keyword spotting, demonstrated robustly in real-world conditions and a user study.

reviewedarXiv / imported corpus page2022

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C. V. Jawahar

The real contribution is not just another VAE-GAN; it is turning lip-to-speech into an arbitrary-speaker problem with credible low-data adaptation.

reviewedarXiv / imported corpus page2022

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao

This paper matters because it makes unconstrained lip-to-speech materially faster without obviously sacrificing quality.

reviewedarXiv / imported corpus page2022

VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

Joanna Hong, Minsu Kim, Yong Man Ro

The paper is really about disentangling identity, and that is why the unseen-speaker results hold up.

reviewedarXiv / imported corpus page2022

SVTS: Scalable Video-to-Speech Synthesis

Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantić

A key scaling contribution that demonstrates simple spectrogram prediction plus pretrained vocoder pipelines outperform prior complex models on diverse datasets, marking foundational progress in large-scale video-to-speech synthesis.

reviewedarXiv / imported corpus page2022

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

The key idea is not generic fusion; it is storing cross-modal correspondences so video-only decoding can recover some audio-side structure later.

reviewedarXiv / imported corpus page2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng

The real move is importing structure from voice conversion, not just adding another speaker embedding.

reviewedarXiv / imported corpus page2022

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Junchen Lu, Berrak Şişman, Rui Liu, Mingyang Zhang, Haizhou Li

VisualTTS effectively improves lip-speech synchronization in scripted voice over by conditioning TTS on lip video, but does not tackle silent speech decoding or unscripted scenarios.

reviewedarXiv / imported corpus page2021

Advances and Challenges in Deep Lip Reading

Marzieh Oghbaie, Arian Sabaghi, Kooshan Hashemifard, Mohammad Kazem Akbari

Good survey, not a model result.

reviewedarXiv / imported corpus page2021

Sub-word Level Lip Reading With Visual Attention

K R Prajwal, Triantafyllos Afouras, Andrew Zisserman

Major lip-reading gain, adjacent to SSI.

reviewedarXiv / imported corpus page2021

Speaker disentanglement in video-to-speech conversion

Dan Oneaţă, Adriana Stan, Horia Cucu

The paper effectively makes speaker identity a controllable factor in multi-speaker video-to-speech synthesis by disentangling it from content, showing the trade-off between intelligibility and voice control on GRID corpus data.

reviewedarXiv / imported corpus page2020

Speech Prediction in Silent Videos using Variational Autoencoders

Ravindra Yadav, Ashish Sardana, Vinay P. Namboodiri, Rajesh M. Hegde

Strong video-to-speech paper that models ambiguity explicitly.

reviewedarXiv / imported corpus page2020

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti, Zheng‐Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

Strong AV speech survey, not an SSI system paper.

reviewedarXiv / imported corpus page2020

Foley Music: Learning to Generate Music from Videos

Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

Strong video-to-music paper, not SSI.

reviewedarXiv / imported corpus page2020

Vocoder-Based Speech Synthesis from Silent Videos

Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emília Gómez, Zheng‐Hua Tan, Jesper Jensen

A notable step forward in lip-to-speech synthesis by predicting full vocoder features and jointly training for recognition, achieving strong speaker-dependent results but lacking unseen speaker generalization.

reviewedarXiv / imported corpus page2019

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Yaman Kumar, Rohit Jain, Khwaja Mohd. Salik, Rajiv Ratn Shah, Yifang Yin, Roger Zimmermann

Strong multi-view lip-to-speech baseline with honest quality limits.

reviewedarXiv / imported corpus page2019

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantić

Foundational direct video-to-audio result with clear generalization limits.

reviewedarXiv / imported corpus page2018

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Yaman Kumar, Mayank Aggarwal, Pratham Nawal, Shin'ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann

Multi-view silent video combined with CNN-LSTM models significantly improves speech audio reconstruction quality over single-view, highlighting the importance of optimal camera placement to address pose variance.

reviewedarXiv / imported corpus page2018

Visual-Only Recognition of Normal, Whispered and Silent Speech

Stavros Petridis, Jie Shen, Doruk Cetin, Maja Pantić

Strong evidence that silent lipreading needs dedicated training.

reviewedarXiv / imported corpus page2018

Cross-modal Embeddings for Video and Audio Retrieval

Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Giró Nieto, Xavier

Useful multimodal retrieval baseline, not SSI.

reviewedarXiv / imported corpus page2017

Lip2AudSpec: Speech reconstruction from silent lip movements video

Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

The paper's auditory spectrogram autoencoder bottleneck target is a key innovation that produces more intelligible, natural reconstructed speech from lip videos than prior methods, as confirmed by objective and human evaluations.

reviewedarXiv / imported corpus page2017

Updating the silent speech challenge benchmark with deep learning

Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, B. Denby

Benchmark update with a real, reproducible WER gain.

reviewedarXiv / imported corpus page2017

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Strong audiovisual speech separation and enhancement leveraging face video for speaker-dependent masking; not a silent speech interface paper.

reviewedarXiv / imported corpus page2017

Improved Speech Reconstruction from Silent Video

Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Strong, benchmark-setting speaker-dependent video-to-speech system that advances speech reconstruction from silent face video but remains limited to per-speaker training and constrained conditions.

reviewedarXiv / imported corpus page2017

Vid2speech: Speech Reconstruction from Silent Video

Ariel Ephrat, Shmuel Peleg

Real lip-to-speech progress, still tightly benchmark-bounded.