2017 · arXiv / imported corpus page · Field expert review · confidence high

Vid2speech: Speech Reconstruction from Silent Video

Ariel Ephrat, Shmuel Peleg

arXiv

Real lip-to-speech progress, still tightly benchmark-bounded.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: The paper is an early but real lip-to-speech milestone: within GRID, silent video can drive intelligible reconstructed speech and even partial OOV recovery.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: The system depends on speaker-specific training, a constrained GRID grammar, and an LPC-style resynthesis path that still sounds unnatural. Evidence is limited to GRID and a relatively small listening study rather than an open-vocabulary or in-the-wild evaluation. No live camera-to-audio system or user-facing deployment is shown. Constrained lip-to-speech reconstruction study. Overclaim risk: Medium; the paper proves benchmark intelligibility, not real-world or speaker-independent lip-to-speech..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech-reconstruction
Modality: silent face video
Hardware: 25 FPS, 720x576 video from the GRID audiovisual corpus
Body site: face / lips
Output: speech audio
Vocabulary: GRID sentence grammar
Metrics: Table 2 reports 82.6% audio-only intelligibility for S4 and 79.9% / 79.0% audio-visual intelligibility for S4 / S2, compared with 40.0% audio-only and 51.9% audio-visual in prior work [10]. Table 3 reports 51.6% OOV audio-visual intelligibility versus 10.0% chance and 93.4% when no digits are held out.
Evaluation mode: Human intelligibility studies on reconstructed audio-only, audio-visual, and out-of-vocabulary settings using MTurk listeners.
Review confidence: high
Overclaim risk: Medium; the paper proves benchmark intelligibility, not real-world or speaker-independent lip-to-speech.

Expert take

The full text backs a stronger claim than the abstract-only version. On the constrained GRID setup, reconstructed speech becomes much more intelligible than prior work: Table 2 shows 82.6% audio-only intelligibility on S4 and about 80% audio-visual intelligibility on S4 and S2, versus 40.0% and 51.9% in the cited baseline. The OOV experiment is also real rather than decorative: Table 3 reports 51.6% accuracy when two digits are held out of training, far above the 10% chance rate. The limits are equally clear: speaker dependence, LPC-like synthesis artifacts, and a fixed 51-word grammar keep this from being a practical open-world SSI.

True value

The paper is an early but real lip-to-speech milestone: within GRID, silent video can drive intelligible reconstructed speech and even partial OOV recovery.

What changed

Canon before

Most visual silent-speech work focused on recognition rather than direct speech reconstruction, and earlier reconstruction quality was substantially lower.

Delta from canon

The paper models automatic speechreading as regression to acoustic features and shows materially improved human intelligibility, including a held-out-digit OOV test.

Position in field

Important early visual speech-reconstruction paper in SSI-adjacent lip-to-speech research.

Evidence

“ In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. ”

author_claim · Abstract · confidence 1.00

“ For this task we trained our model on a random 80/20 train/test split Audio-visual 51.6% 93.4% 10.0% of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. ”

validation_scope · 4.2. Sound prediction tasks · confidence 1.00

“ Audio-visual 51.9% 79.9% 79% In order to accurately compare our results with [10], we performed our experiments on the 1000 videos of speaker four (S4, female) as done there. ”

metric · Table 2. Our reconstructed speech is significantly more intelligible than the results of [10]. · confidence 1.00

“ For this task we trained our model on a random 80/20 train/test split Audio-visual 51.6% 93.4% 10.0% of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. ”

metric · Table 3. Out-of-vocabulary (OOV) intelligibility results. · confidence 1.00

Limits

Technical limits

The system depends on speaker-specific training, a constrained GRID grammar, and an LPC-style resynthesis path that still sounds unnatural.

Evaluation limits

Evidence is limited to GRID and a relatively small listening study rather than an open-vocabulary or in-the-wild evaluation.

Deployment limits

No live camera-to-audio system or user-facing deployment is shown.

Scope limits

Constrained lip-to-speech reconstruction study.