Vid2speech: Speech Reconstruction from Silent Video
Real lip-to-speech progress, still tightly benchmark-bounded.
Reading guidance
- Verdict
- full-text draft · priority high · confidence high
- Why it matters
- The paper is an early but real lip-to-speech milestone: within GRID, silent video can drive intelligible reconstructed speech and even partial OOV recovery.
- What to trust
- Basis: full text. Coverage: high. 4 evidence records back the review.
- What is weak
- The system depends on speaker-specific training, a constrained GRID grammar, and an LPC-style resynthesis path that still sounds unnatural. Evidence is limited to GRID and a relatively small listening study rather than an open-vocabulary or in-the-wild evaluation. No live camera-to-audio system or user-facing deployment is shown. Constrained lip-to-speech reconstruction study. Overclaim risk: Medium; the paper proves benchmark intelligibility, not real-world or speaker-independent lip-to-speech..
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- speech-reconstruction
- Modality
- silent face video
- Hardware
- 25 FPS, 720x576 video from the GRID audiovisual corpus
- Body site
- face / lips
- Output
- speech audio
- Vocabulary
- GRID sentence grammar
- Metrics
- Table 2 reports 82.6% audio-only intelligibility for S4 and 79.9% / 79.0% audio-visual intelligibility for S4 / S2, compared with 40.0% audio-only and 51.9% audio-visual in prior work [10]. Table 3 reports 51.6% OOV audio-visual intelligibility versus 10.0% chance and 93.4% when no digits are held out.
- Evaluation mode
- Human intelligibility studies on reconstructed audio-only, audio-visual, and out-of-vocabulary settings using MTurk listeners.
- Review confidence
- high
- Overclaim risk
- Medium; the paper proves benchmark intelligibility, not real-world or speaker-independent lip-to-speech.
Expert take
The full text backs a stronger claim than the abstract-only version. On the constrained GRID setup, reconstructed speech becomes much more intelligible than prior work: Table 2 shows 82.6% audio-only intelligibility on S4 and about 80% audio-visual intelligibility on S4 and S2, versus 40.0% and 51.9% in the cited baseline. The OOV experiment is also real rather than decorative: Table 3 reports 51.6% accuracy when two digits are held out of training, far above the 10% chance rate. The limits are equally clear: speaker dependence, LPC-like synthesis artifacts, and a fixed 51-word grammar keep this from being a practical open-world SSI.
True value
The paper is an early but real lip-to-speech milestone: within GRID, silent video can drive intelligible reconstructed speech and even partial OOV recovery.
What changed
Canon before
Most visual silent-speech work focused on recognition rather than direct speech reconstruction, and earlier reconstruction quality was substantially lower.
Delta from canon
The paper models automatic speechreading as regression to acoustic features and shows materially improved human intelligibility, including a held-out-digit OOV test.
Position in field
Important early visual speech-reconstruction paper in SSI-adjacent lip-to-speech research.
Evidence
“ In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. ”
author_claim · Abstract · confidence 1.00
“ For this task we trained our model on a random 80/20 train/test split Audio-visual 51.6% 93.4% 10.0% of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. ”
validation_scope · 4.2. Sound prediction tasks · confidence 1.00
“ Audio-visual 51.9% 79.9% 79% In order to accurately compare our results with [10], we performed our experiments on the 1000 videos of speaker four (S4, female) as done there. ”
metric · Table 2. Our reconstructed speech is significantly more intelligible than the results of [10]. · confidence 1.00
“ For this task we trained our model on a random 80/20 train/test split Audio-visual 51.6% 93.4% 10.0% of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. ”
metric · Table 3. Out-of-vocabulary (OOV) intelligibility results. · confidence 1.00
Limits
Technical limits
The system depends on speaker-specific training, a constrained GRID grammar, and an LPC-style resynthesis path that still sounds unnatural.
Evaluation limits
Evidence is limited to GRID and a relatively small listening study rather than an open-vocabulary or in-the-wild evaluation.
Deployment limits
No live camera-to-audio system or user-facing deployment is shown.
Scope limits
Constrained lip-to-speech reconstruction study.