2021 · arXiv / imported corpus page · Field expert review · confidence high

Sub-word Level Lip Reading With Visual Attention

K R Prajwal, Triantafyllos Afouras, Andrew Zisserman

arXiv

Major lip-reading gain, adjacent to SSI.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: This is a strong visual speech recognition paper with real benchmark gains, but it is camera-only lip reading rather than an articulatory SSI modality.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: The system still depends on face video quality and benchmark-style training corpora. All results are offline benchmarks; no live silent dictation study is reported. No discussion of on-device latency, privacy, or in-the-wild robustness beyond benchmarks. Camera-only visual speech recognition and detection. Overclaim risk: Overclaim begins if lip-reading benchmark gains are treated as solved general SSI..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: lip reading
Modality: silent video
Hardware: camera
Body site: face; lip
Output: text
Vocabulary: WordPiece sub-word units
Metrics: Public-data training reaches 28.9 WER on LRS2 and 40.6 on LRS3; extended training reaches 22.6 and 30.7; WordPiece reduces LRS2 WER from 41.0 to 37.2 and VTP further to 30.9
Evaluation mode: LRS2 and LRS3 WER benchmarks with ablations, plus AVA ActiveSpeaker visual speech detection transfer
Review confidence: high
Overclaim risk: Overclaim begins if lip-reading benchmark gains are treated as solved general SSI.

Expert take

The paper is materially stronger than prior public-data lip reading systems. The final model reaches 28.9 WER on LRS2 using only public data, and 22.6 with additional data, while the ablation table shows the gains are not accidental: WordPiece decoding and visual transformer pooling each buy substantial error reduction. The scope caveat is straightforward. This is a camera-based VSR system, not a tongue, EMG, or ultrasound SSI device.

True value

This is a strong visual speech recognition paper with real benchmark gains, but it is camera-only lip reading rather than an articulatory SSI modality.

What changed

Canon before

Lip-reading systems often used character-level decoding and average pooling over face features, leaving performance and data efficiency on the table.

Delta from canon

Introduces visual transformer pooling and WordPiece decoding, then reuses the encoder for visual speech detection.

Position in field

Top-tier lip-reading paper adjacent to SSI.

Evidence

“ The videos in- pooling on the spatial feature map; (ii) the use of sub- cluded in datasets like LRS2 and LRS3 are commonly pre- word units, rather than characters for the language tokens; processed with a face detection and tracking pipeline which and (iii) a strong Visual Speech Detection model, directly outputs clips roughly centered around the speaker’s face. trained on top of the lip reading encoder. ”

author_claim · Abstract · confidence 0.99

“ Comparison of different lip reading models on the test sets of the LRS2 and LRS3 datasets in terms of Word Error Rate % (WER, lower is better), along with the datasets and the aggregate number of hours used for training each model. ”

metric · Table 1. Comparison of different lip reading models on the test sets of the LRS2 and LRS3 datasets in terms of Word Error Rate % (WER, · confidence 0.99

“ This is evident in Table 3, where prior work trained on public data, on both the LRS2 and pooling after conv2,3 at a spatial resolution of 24 × 24 is LRS3 benchmarks. ”

actual_novelty · Table 2. Ablation on the design improvements proposed in this · confidence 0.98

Limits

Technical limits

The system still depends on face video quality and benchmark-style training corpora.

Evaluation limits

All results are offline benchmarks; no live silent dictation study is reported.

Deployment limits

No discussion of on-device latency, privacy, or in-the-wild robustness beyond benchmarks.

Scope limits

Camera-only visual speech recognition and detection.