2018 · arXiv / imported corpus page · Field expert review · confidence high

Visual-Only Recognition of Normal, Whispered and Silent Speech

Stavros Petridis, Jie Shen, Doruk Cetin, Maja Pantić

arXiv

Strong evidence that silent lipreading needs dedicated training.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: This is a core SSI result because the full text shows silent visual speech is a distinct regime, not just a slightly harder version of normal lipreading.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: The task remains closed-vocabulary and laboratory-recorded, and the reported recognition rates are still far from practical open-ended silent communication. Only digits and fixed phrases are evaluated; there is no open-vocabulary or in-the-wild test. No real deployment or live camera interface is shown. Visual-only closed-vocabulary SSI recognition study. Overclaim risk: Low for the transfer and dataset claims, medium if generalized to open-vocabulary silent speech deployment..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech recognition
Modality: visual speech video across normal, whispered, and silent modes
Hardware: Three cameras capturing frontal, 45-degree, and profile views at 1280x780 and 30 fps
Body site: face / lips
Output: text
Vocabulary: digits and short phrases
Metrics: Digits matched-condition classification rates are 68.0% for normal, 70.5% for whispered, and 62.2% for silent speech. Phrases matched-condition rates are 69.7%, 70.8%, and 64.4%, while training on normal and testing on silent drops to 59.7% for digits and 61.2% for phrases.
Evaluation mode: Repeated subject-independent train / validation / test experiments for digits and phrases, with matched and mismatched train-test speech modes.
Review confidence: high
Overclaim risk: Low for the transfer and dataset claims, medium if generalized to open-vocabulary silent speech deployment.

Expert take

The value of the paper is empirical clarity. It records a reasonably sized three-mode database and shows that silent speech underperforms both normal and whispered speech even in matched conditions, while mismatched training makes things worse. In the digits experiment, silent matched performance is 62.2% and normal-trained testing on silent drops to 59.7%. In the phrases experiment, silent matched performance is 64.4% and normal-trained testing on silent drops to 61.2%. That makes the main lesson hard to ignore: silent visual speech should be treated as its own training regime rather than borrowed from vocalized lipreading.

True value

This is a core SSI result because the full text shows silent visual speech is a distinct regime, not just a slightly harder version of normal lipreading.

What changed

Canon before

Visual speech systems often assumed that vocalized data would transfer acceptably to whispered or silent speech.

Delta from canon

The paper tests that assumption directly and shows silent speech is consistently worse and not well served by vocalized-only training.

Position in field

Important visual-only SSI benchmark and transfer-analysis paper.

Evidence

“ DATABASE DESCRIPTION participants (32 males and 7 females) were recorded for this part with a mean age and standard deviation of 26.3 and 3.8 For the purposes of this study we have recorded a new audio- years, respectively. visual database which contains normal, whispered and silent The database was recorded in a lab environment using 3 speech. ”

validation_scope · 2. DATABASE DESCRIPTION · confidence 1.00

“ For example, the performance of a model trained ments, as it has already been reported in the phonetics litera- on normal speech drops by 3.3% and 8.3% when tested on ture, which affect the performance of models when the train- whispered and silent speech examples, respectively. ”

metric · Table 1. Mean classification rate (and standard deviation) for the digits experiment. · confidence 1.00

“ For example, the performance of a model trained ments, as it has already been reported in the phonetics litera- on normal speech drops by 3.3% and 8.3% when tested on ture, which affect the performance of models when the train- whispered and silent speech examples, respectively. ”

metric · Table 2. Mean classification rate (and standard deviation) for the phrases experiment. · confidence 1.00

“ This reveals In other words, the lip movements in vocalised and silent that there are indeed visual differences between the 3 speech speech are different and this may degrade the performance modes and the common assumption that vocalized training of models trained on vocalised speech and tested on silent data can be used directly to train a silent speech recognition speech. ”

actual_novelty · 6. CONCLUSION · confidence 1.00

Limits

Technical limits

The task remains closed-vocabulary and laboratory-recorded, and the reported recognition rates are still far from practical open-ended silent communication.

Evaluation limits

Only digits and fixed phrases are evaluated; there is no open-vocabulary or in-the-wild test.

Deployment limits

No real deployment or live camera interface is shown.

Scope limits

Visual-only closed-vocabulary SSI recognition study.