Video
This page groups the current SSI review database by the real `modality:` tag `modality:video`.
The list below includes every paper page that currently carries this technique label.
Papers
Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading
The paper advances silent speech synthesis by leveraging masked training to robustly fuse electromyography and lipreading, showing improved performance and resilience, but adaptation to laryngectomized users remains challenging.
SonicVisionLM: Playing Sound with Vision Language Models
A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
Strong lip-to-speech system that reduces ambiguity via SSL linguistic conditioning, variance predictors, and flow-based refinement, achieving near-vocoded naturalness and improved intelligibility on standard datasets.
An Initial Exploration: Learning to Generate Realistic Audio for Silent Video
Honest exploratory comparison showing transformer-based model outperforms deep-fusion CNN and Wavenet for generating low-to-mid frequency audio from silent video in a small curated dataset; not a speech or SSI paper.
Audio Knowledge Empowered Visual Speech Recognition
The paper advances visual speech recognition by selectively transferring refined linguistic audio knowledge via a learned compact memory and cross-attention injection, improving benchmark WERs over prior audio-assisted methods without requiring audio inputs during inference.
Audio-visual video-to-speech synthesis with synthesized input audio
The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Strong AVS result, outside SSI: the useful idea is audio-conditioned decoder queries plus dynamic mask prediction.
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
Strong modular SSL-based lip-to-speech synthesis paper that innovatively maps lip SSL features to disentangled speech embeddings before vocoder synthesis, demonstrating improved intelligibility and robustness across benchmark datasets.
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
This video-conditioned AVO system innovatively supervises alignment by predicting discrete speech units rather than reconstructing acoustic features, leading to better lip-sync and speech quality on a single-speaker dataset; however, it is not an SSI interface paper.
Large-scale unsupervised audio pre-training for video-to-speech synthesis
Good decoder-transfer pretraining improves video-to-speech quality on several benchmarks, but WER gains are not consistent. A useful methodological contribution with strong benchmark support, adjacent to SSI rather than a deployable system.
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
Strong full-text paper demonstrating that inference-time text guidance via ASR classifier is key to significantly improved intelligibility in lip-to-speech synthesis on challenging in-the-wild video datasets, outperforming prior baselines.
Intelligible Lip-to-Speech Synthesis with Speech Units
Speech units as a pseudo-text target enable strong content supervision that substantially cuts WER without text labels, and the multi-input vocoder improves speech quality from blurry mel outputs, yielding a state-of-the-art lip-to-speech system on LRS benchmarks.
Zero-shot personalized lip-to-speech synthesis with face image based voice control
Demonstrates effective zero-shot voice control in Lip2Speech by leveraging face image-based speaker embeddings, validated on GRID corpus but constrained by dataset vocabulary and speech naturalness.
Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning
Strong viseme-level metric learning approach reduces silent speech VSR errors on a small 10-phrase dataset, notably achieving parity with baselines using much less silent data.
Conditional Generation of Audio from Video via Foley Analogies
The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.
Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training
Strong SSI paper improving silent speech reconstruction by generating pseudo acoustic targets and using domain adversarial training to address domain mismatch; validated with TaL dataset showing substantial WER and MOS gains over TaLNet.
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
LipLearner is a strong mobile silent speech system that uniquely closes the loop from few-shot lipreading model design to practical on-device customization and keyword spotting, demonstrated robustly in real-world conditions and a user study.
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
The real contribution is not just another VAE-GAN; it is turning lip-to-speech into an arbitrary-speaker problem with credible low-data adaptation.
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis
This paper matters because it makes unconstrained lip-to-speech materially faster without obviously sacrificing quality.
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
The paper is really about disentangling identity, and that is why the unseen-speaker results hold up.
SVTS: Scalable Video-to-Speech Synthesis
A key scaling contribution that demonstrates simple spectrogram prediction plus pretrained vocoder pipelines outperform prior complex models on diverse datasets, marking foundational progress in large-scale video-to-speech synthesis.
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video
The key idea is not generic fusion; it is storing cross-modal correspondences so video-only decoding can recover some audio-side structure later.
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion
The real move is importing structure from voice conversion, not just adding another speaker embedding.
VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over
VisualTTS effectively improves lip-speech synchronization in scripted voice over by conditioning TTS on lip video, but does not tackle silent speech decoding or unscripted scenarios.
Advances and Challenges in Deep Lip Reading
Good survey, not a model result.
Sub-word Level Lip Reading With Visual Attention
Major lip-reading gain, adjacent to SSI.
Speaker disentanglement in video-to-speech conversion
The paper effectively makes speaker identity a controllable factor in multi-speaker video-to-speech synthesis by disentangling it from content, showing the trade-off between intelligibility and voice control on GRID corpus data.
Speech Prediction in Silent Videos using Variational Autoencoders
Strong video-to-speech paper that models ambiguity explicitly.
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Strong AV speech survey, not an SSI system paper.
Foley Music: Learning to Generate Music from Videos
Strong video-to-music paper, not SSI.
Vocoder-Based Speech Synthesis from Silent Videos
A notable step forward in lip-to-speech synthesis by predicting full vocoder features and jointly training for recognition, achieving strong speaker-dependent results but lacking unseen speaker generalization.
Lipper: Synthesizing Thy Speech using Multi-View Lipreading
Strong multi-view lip-to-speech baseline with honest quality limits.
Video-Driven Speech Reconstruction using Generative Adversarial Networks
Foundational direct video-to-audio result with clear generalization limits.
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Multi-view silent video combined with CNN-LSTM models significantly improves speech audio reconstruction quality over single-view, highlighting the importance of optimal camera placement to address pose variance.
Visual-Only Recognition of Normal, Whispered and Silent Speech
Strong evidence that silent lipreading needs dedicated training.
Cross-modal Embeddings for Video and Audio Retrieval
Useful multimodal retrieval baseline, not SSI.
Lip2AudSpec: Speech reconstruction from silent lip movements video
The paper's auditory spectrogram autoencoder bottleneck target is a key innovation that produces more intelligible, natural reconstructed speech from lip videos than prior methods, as confirmed by objective and human evaluations.
Updating the silent speech challenge benchmark with deep learning
Benchmark update with a real, reproducible WER gain.
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Strong audiovisual speech separation and enhancement leveraging face video for speaker-dependent masking; not a silent speech interface paper.
Improved Speech Reconstruction from Silent Video
Strong, benchmark-setting speaker-dependent video-to-speech system that advances speech reconstruction from silent face video but remains limited to per-speaker training and constrained conditions.
Vid2speech: Speech Reconstruction from Silent Video
Real lip-to-speech progress, still tightly benchmark-bounded.