2022 · arXiv / imported corpus page · Field expert review · confidence high

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao

This paper matters because it makes unconstrained lip-to-speech materially faster without obviously sacrificing quality.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: The full text supports a real systems gain: FastLTS keeps competitive perceptual quality while pushing waveform inference to 19.76x the autoregressive baseline at 3-second input length.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: Evidence is still limited to benchmark corpora with offline generation windows and three evaluated speakers per dataset slice. The main quality evidence is subjective MOS plus GRID PESQ; there is no in-the-wild user study or end-to-end conversational latency test. The paper supports lower-latency generation, but not a live camera-to-audio interactive deployment. Unconstrained lip-to-speech synthesis from face video only. Overclaim risk: The full text supports faster offline waveform generation on benchmark corpora, not a proven production-ready real-time SSI..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech-reconstruction
Modality: silent talking-face video
Body site: face; lip
Output: speech-audio
Vocabulary: large-vocabulary unconstrained speech
Metrics: MOS quality/intelligibility/naturalness; PESQ; waveform acceleration ratio; parameter count
Evaluation mode: subjective MOS on Lip2Wav and GRID, PESQ on GRID, plus mel and waveform inference-speed comparison
Review confidence: high
Overclaim risk: The full text supports faster offline waveform generation on benchmark corpora, not a proven production-ready real-time SSI.

Expert take

The strongest evidence is joint, not isolated. Table 3 shows FastLTS improving GRID MOS over Lip2Wav from 3.27/3.47/3.54 to 3.59/3.68/3.73 for quality, intelligibility, and naturalness, while Section 5.5 reports 19.76x waveform speedup at a 3-second window. Table 5 also shows the speedup is not bought with GlowLTS-scale bloat: FastLTS uses 50.09M parameters versus 39.87M for Lip2Wav and 85.92M for GlowLTS. The remaining caution is quality headroom: Table 4 gives FastLTS a GRID PESQ of 1.939, which is strong but not the top reported number in that comparison.

True value

The full text supports a real systems gain: FastLTS keeps competitive perceptual quality while pushing waveform inference to 19.76x the autoregressive baseline at 3-second input length.

What changed

Canon before

Unconstrained lip-to-speech systems typically predicted mel-spectrograms first and then relied on slow autoregressive or heavy flow-based waveform generation.

Delta from canon

FastLTS removes the intermediate spectrogram bottleneck from the main inference path and uses a fully parallelized decoder plus GAN vocoder.

Position in field

Strong unconstrained lip-to-speech systems paper centered on latency reduction rather than a new sensing modality.

Evidence

“ To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios Figure 1: Illustration of end-to-end lip-to-speech synthesis. from unconstrained talking videos with low latency, and has a rel- Corresponding speech audios are generated conditioned on atively small model size. ”

author_claim · ABSTRACT · confidence 0.99

“ Chemistry 1.89 ± 0.04 1.47 ± 0.06 1.91 ± 0.03 5 EXPERIMENTS AND RESULTS Lectures 5.1 Datasets Lip2Wav The Lip2Wav dataset [24] is the largest and most com- transformer with hidden dimension 𝑑𝑠 being 36, number of heads ℎ𝑠 monly used dataset for unconstrained lip-to-speech synthesis. ”

validation_scope · 5.1 Datasets · confidence 0.98

“ These algorithms measure the Table 3: MOS on GRID Dataset distortion of a noisy signal relative to the original one, while GANs may produce intelligible speeches with different intonation from Method Quality Intelligibility Naturalness the original speeches, which causes a relatively poor STOI value Lip2Wav 3.27 ± 0.11 3.47 ± 0.13 3.54 ± 0.12 yet does no damage to intelligibility and naturalness. ”

metric · Table 3: MOS on GRID Dataset · confidence 0.99

“ The acceleration ratio of wave- is responsible for modeling the low-level structures of the audio, form synthesis reaches 19.76× when the input length is 3 seconds. while the waveform generator is responsible for complementing It is worth noting that the acceleration ratio of our model exceeds high-dimensional details. that of GlowLTS [15]. ”

metric · 5.5 Inference Speedup · confidence 0.99

Limits

Technical limits

Evidence is still limited to benchmark corpora with offline generation windows and three evaluated speakers per dataset slice.

Evaluation limits

The main quality evidence is subjective MOS plus GRID PESQ; there is no in-the-wild user study or end-to-end conversational latency test.

Deployment limits

The paper supports lower-latency generation, but not a live camera-to-audio interactive deployment.

Scope limits

Unconstrained lip-to-speech synthesis from face video only.