← SSI archive · Review rubric

2017 · arXiv / imported corpus page · Field expert review · confidence high

Vid2speech: Speech Reconstruction from Silent Video

Ariel Ephrat, Shmuel Peleg

Real lip-to-speech progress, still tightly benchmark-bounded.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict
full-text draft · priority high · confidence high
Why it matters
The paper is an early but real lip-to-speech milestone: within GRID, silent video can drive intelligible reconstructed speech and even partial OOV recovery.
What to trust
Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak
The system depends on speaker-specific training, a constrained GRID grammar, and an LPC-style resynthesis path that still sounds unnatural. Evidence is limited to GRID and a relatively small listening study rather than an open-vocabulary or in-the-wild evaluation. No live camera-to-audio system or user-facing deployment is shown. Constrained lip-to-speech reconstruction study. Overclaim risk: Medium; the paper proves benchmark intelligibility, not real-world or speaker-independent lip-to-speech..
Read before
SSI review rubric
Read next
SSI archive

Axes

Task
speech-reconstruction
Modality
silent face video
Hardware
25 FPS, 720x576 video from the GRID audiovisual corpus
Body site
face / lips
Output
speech audio
Vocabulary
GRID sentence grammar
Metrics
Table 2 reports 82.6% audio-only intelligibility for S4 and 79.9% / 79.0% audio-visual intelligibility for S4 / S2, compared with 40.0% audio-only and 51.9% audio-visual in prior work [10]. Table 3 reports 51.6% OOV audio-visual intelligibility versus 10.0% chance and 93.4% when no digits are held out.
Evaluation mode
Human intelligibility studies on reconstructed audio-only, audio-visual, and out-of-vocabulary settings using MTurk listeners.
Review confidence
high
Overclaim risk
Medium; the paper proves benchmark intelligibility, not real-world or speaker-independent lip-to-speech.

Expert take

The full text backs a stronger claim than the abstract-only version. On the constrained GRID setup, reconstructed speech becomes much more intelligible than prior work: Table 2 shows 82.6% audio-only intelligibility on S4 and about 80% audio-visual intelligibility on S4 and S2, versus 40.0% and 51.9% in the cited baseline. The OOV experiment is also real rather than decorative: Table 3 reports 51.6% accuracy when two digits are held out of training, far above the 10% chance rate. The limits are equally clear: speaker dependence, LPC-like synthesis artifacts, and a fixed 51-word grammar keep this from being a practical open-world SSI.

True value

The paper is an early but real lip-to-speech milestone: within GRID, silent video can drive intelligible reconstructed speech and even partial OOV recovery.

What changed

Canon before

Most visual silent-speech work focused on recognition rather than direct speech reconstruction, and earlier reconstruction quality was substantially lower.

Delta from canon

The paper models automatic speechreading as regression to acoustic features and shows materially improved human intelligibility, including a held-out-digit OOV test.

Position in field

Important early visual speech-reconstruction paper in SSI-adjacent lip-to-speech research.

Evidence

“ In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. ”

author_claim · Abstract · confidence 1.00

“ For this task we trained our model on a random 80/20 train/test split Audio-visual 51.6% 93.4% 10.0% of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. ”

validation_scope · 4.2. Sound prediction tasks · confidence 1.00

“ Audio-visual 51.9% 79.9% 79% In order to accurately compare our results with [10], we performed our experiments on the 1000 videos of speaker four (S4, female) as done there. ”

metric · Table 2. Our reconstructed speech is significantly more intelligible than the results of [10]. · confidence 1.00

“ For this task we trained our model on a random 80/20 train/test split Audio-visual 51.6% 93.4% 10.0% of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. ”

metric · Table 3. Out-of-vocabulary (OOV) intelligibility results. · confidence 1.00

Limits

Technical limits

The system depends on speaker-specific training, a constrained GRID grammar, and an LPC-style resynthesis path that still sounds unnatural.

Evaluation limits

Evidence is limited to GRID and a relatively small listening study rather than an open-vocabulary or in-the-wild evaluation.

Deployment limits

No live camera-to-audio system or user-facing deployment is shown.

Scope limits

Constrained lip-to-speech reconstruction study.