2023 · arXiv / imported corpus page · Field expert review · confidence high

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

Strong full-text paper demonstrating that inference-time text guidance via ASR classifier is key to significantly improved intelligibility in lip-to-speech synthesis on challenging in-the-wild video datasets, outperforming prior baselines.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: Shows the crucial role of leveraging text inferred by lip-reading at inference as classifier guidance to overcome intrinsic ambiguity in lip motion, resulting in state-of-the-art intelligibility and naturalness in lip-to-speech synthesis for in-the-wild data.
What to trust: Basis: full text. Coverage: high. 8 evidence records back the review.
What is weak: Depends on quality and performance of large pretrained lip-reader and ASR models plus heavy diffusion model; generation is offline with hundreds of inference steps limiting real-time use; quality degrades with less accurate lip-readers; risk of adversarial misuse via text injection. Evaluations are strong on benchmark datasets LRS2 and LRS3 but not aimed at real-time, mobile, or on-device use; reliance on quality lip-reader and ASR. Heavy model stack including lip-reader, ASR, and diffusion; no focus on low-latency or on-device deployment; recognized risk of misuse from malicious text injection. Lip-to-speech from silent lip video only; does not handle audio/video noisy inputs or multimodal fusion beyond lip video and inferred text guidance. Overclaim risk: medium.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech-reconstruction
Modality: silent lip video
Hardware: camera
Body site: lip
Output: speech-audio
Vocabulary: open
Metrics: Human mos scores for intelligibility, naturalness, quality, synchronization with mean and confidence intervals; objective automatic speech recognition word error rate (WER), non-intrusive speech quality and intelligibility DNSMOS and STOI-Net; SyncNet audio-visual synchronization metrics (LSE-C, LSE-D). Exact WER on LRS3 test goes from 21.4% with LipVoicer to 86.2% without ASR guidance; LipVoicer outperforms baselines Lip2Speech, VCA-GAN, SVTS in both human and objective metrics.
Evaluation mode: Human mean opinion scores (MOS) on intelligibility, naturalness, quality, synchronization; objective metrics including WER, DNSMOS, STOI-Net, SyncNet metrics on LRS2 and LRS3 test sets.
Review confidence: high
Overclaim risk: medium

Expert take

LipVoicer advances lip-to-speech synthesis by decoupling the task into lip-reading and guided diffusion generation, using classifier guidance derived from ASR on inferred text at inference time. This disentangles text content estimation from speech characteristic generation, markedly improving intelligibility and naturalness on challenging, in-the-wild datasets LRS2 and LRS3 compared to recent baselines. Extensive evaluation shows LipVoicer nearly matches ground truth metrics on human and machine assessments, and ablative studies demonstrate that ASR-guided text guidance at inference is critical, dropping WER from ~21% to over 86% without it. The system is modular and allows substitution of lip-reader and ASR for improvements. However, it relies on heavy models and offline generation, limiting deployment on mobile or real-time scenarios, and poses risks of speech manipulation if malicious text is injected. Overall, LipVoicer sets a new standard for intelligible and natural lip-to-speech synthesis on complex data, useful as a reference architecture for future work focused on intelligibility rather than real-time constraints.

True value

Shows the crucial role of leveraging text inferred by lip-reading at inference as classifier guidance to overcome intrinsic ambiguity in lip motion, resulting in state-of-the-art intelligibility and naturalness in lip-to-speech synthesis for in-the-wild data.

What changed

Canon before

Prior lip-to-speech systems map silent lip video directly to audio or audio features, often producing ambiguous or unintelligible speech on in-the-wild datasets with diverse speakers and open vocabulary.

Delta from canon

LipVoicer incorporates inferred text from a lip-reading network at inference time as guidance for diffusion-based speech generation, reducing ambiguity and improving intelligibility and synchronization.

Position in field

A top-tier recent lip-to-speech synthesis achieving near ground-truth intelligibility and naturalness on unconstrained in-the-wild datasets, setting a new state-of-the-art for video-to-speech intelligibility over prior direct regression or unit-based methods.

Evidence

“ An ASR steers MelGen, which generates the mel- spectrogram, in the direction of the extracted text using classifier guidance, such that the generated mel-spectrogram reflects the spoken text. (b) MelGen, our diffusion denoising model that generates mel-spectrograms conditioned on a face image and a mouth region video extracted from the full-face video using classifier-free guidance. ”

author_claim · Abstract · confidence 1.00

“ The architecture of LipVoicer requires several design choices: the values of w1 , w2 , the lip reading network and the ASR used for guidance. ”

actual_novelty · 4 · confidence 1.00

“ LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary. ”

validation_scope · 5 · confidence 1.00

“ It is also clear that ASR guidance is vital, as without it (w2 = 0) the WER plunges from 21.4% to 86.2% on LRS3. ”

metric · 5 · confidence 1.00

“ Implementation Details For predicting the text from the silent video at inference time, we use Ma et al. (2023) as our lip-reader for LRS2 and LRS3. ”

limitation · 6 · confidence 1.00

“ Figure 1: An illustration of LipVoicer, a dual-stage framework for lip-to-speech. (a) To generate the speech from a given silent video at inference time, a pre-trained lip-reader provides additional guidance by predicting the text from the video. ”

deployment_claim · 6 · confidence 1.00

“ Furthermore, we show through human evaluation that LipVoicer faithfully recovers the ground truth speech and surpasses recent baselines in intelligibility, naturalness, quality, and synchronization. ”

fact · 5 · confidence 1.00

“ Implementation Details For predicting the text from the silent video at inference time, we use Ma et al. (2023) as our lip-reader for LRS2 and LRS3. ”

fact · 5 · confidence 1.00

Limits

Technical limits

Depends on quality and performance of large pretrained lip-reader and ASR models plus heavy diffusion model; generation is offline with hundreds of inference steps limiting real-time use; quality degrades with less accurate lip-readers; risk of adversarial misuse via text injection.

Evaluation limits

Evaluations are strong on benchmark datasets LRS2 and LRS3 but not aimed at real-time, mobile, or on-device use; reliance on quality lip-reader and ASR.

Deployment limits

Heavy model stack including lip-reader, ASR, and diffusion; no focus on low-latency or on-device deployment; recognized risk of misuse from malicious text injection.

Scope limits

Lip-to-speech from silent lip video only; does not handle audio/video noisy inputs or multimodal fusion beyond lip video and inferred text guidance.