Visual-Only Recognition of Normal, Whispered and Silent Speech
Strong evidence that silent lipreading needs dedicated training.
Reading guidance
- Verdict
- full-text draft · priority high · confidence high
- Why it matters
- This is a core SSI result because the full text shows silent visual speech is a distinct regime, not just a slightly harder version of normal lipreading.
- What to trust
- Basis: full text. Coverage: high. 4 evidence records back the review.
- What is weak
- The task remains closed-vocabulary and laboratory-recorded, and the reported recognition rates are still far from practical open-ended silent communication. Only digits and fixed phrases are evaluated; there is no open-vocabulary or in-the-wild test. No real deployment or live camera interface is shown. Visual-only closed-vocabulary SSI recognition study. Overclaim risk: Low for the transfer and dataset claims, medium if generalized to open-vocabulary silent speech deployment..
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- speech recognition
- Modality
- visual speech video across normal, whispered, and silent modes
- Hardware
- Three cameras capturing frontal, 45-degree, and profile views at 1280x780 and 30 fps
- Body site
- face / lips
- Output
- text
- Vocabulary
- digits and short phrases
- Metrics
- Digits matched-condition classification rates are 68.0% for normal, 70.5% for whispered, and 62.2% for silent speech. Phrases matched-condition rates are 69.7%, 70.8%, and 64.4%, while training on normal and testing on silent drops to 59.7% for digits and 61.2% for phrases.
- Evaluation mode
- Repeated subject-independent train / validation / test experiments for digits and phrases, with matched and mismatched train-test speech modes.
- Review confidence
- high
- Overclaim risk
- Low for the transfer and dataset claims, medium if generalized to open-vocabulary silent speech deployment.
Expert take
The value of the paper is empirical clarity. It records a reasonably sized three-mode database and shows that silent speech underperforms both normal and whispered speech even in matched conditions, while mismatched training makes things worse. In the digits experiment, silent matched performance is 62.2% and normal-trained testing on silent drops to 59.7%. In the phrases experiment, silent matched performance is 64.4% and normal-trained testing on silent drops to 61.2%. That makes the main lesson hard to ignore: silent visual speech should be treated as its own training regime rather than borrowed from vocalized lipreading.
True value
This is a core SSI result because the full text shows silent visual speech is a distinct regime, not just a slightly harder version of normal lipreading.
What changed
Canon before
Visual speech systems often assumed that vocalized data would transfer acceptably to whispered or silent speech.
Delta from canon
The paper tests that assumption directly and shows silent speech is consistently worse and not well served by vocalized-only training.
Position in field
Important visual-only SSI benchmark and transfer-analysis paper.
Evidence
“ DATABASE DESCRIPTION participants (32 males and 7 females) were recorded for this part with a mean age and standard deviation of 26.3 and 3.8 For the purposes of this study we have recorded a new audio- years, respectively. visual database which contains normal, whispered and silent The database was recorded in a lab environment using 3 speech. ”
validation_scope · 2. DATABASE DESCRIPTION · confidence 1.00
“ For example, the performance of a model trained ments, as it has already been reported in the phonetics litera- on normal speech drops by 3.3% and 8.3% when tested on ture, which affect the performance of models when the train- whispered and silent speech examples, respectively. ”
metric · Table 1. Mean classification rate (and standard deviation) for the digits experiment. · confidence 1.00
“ For example, the performance of a model trained ments, as it has already been reported in the phonetics litera- on normal speech drops by 3.3% and 8.3% when tested on ture, which affect the performance of models when the train- whispered and silent speech examples, respectively. ”
metric · Table 2. Mean classification rate (and standard deviation) for the phrases experiment. · confidence 1.00
“ This reveals In other words, the lip movements in vocalised and silent that there are indeed visual differences between the 3 speech speech are different and this may degrade the performance modes and the common assumption that vocalized training of models trained on vocalised speech and tested on silent data can be used directly to train a silent speech recognition speech. ”
actual_novelty · 6. CONCLUSION · confidence 1.00
Limits
Technical limits
The task remains closed-vocabulary and laboratory-recorded, and the reported recognition rates are still far from practical open-ended silent communication.
Evaluation limits
Only digits and fixed phrases are evaluated; there is no open-vocabulary or in-the-wild test.
Deployment limits
No real deployment or live camera interface is shown.
Scope limits
Visual-only closed-vocabulary SSI recognition study.