2021 · arXiv / imported corpus page · Field expert review · confidence high

Silent versus modal multi-speaker speech recognition from ultrasound and video

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

arXiv

Large-corpus baseline with real silent-mode gap.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: The strongest contribution is not headline WER but the diagnosis that silent speech is slower, occupies a smaller articulatory space, and benefits from adaptation but remains far harder than modal speech.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: Performance is still far from usable for many silent conditions, and the pipeline relies on careful ultrasound capture plus external adaptation tricks. No live interface evaluation or end-user task is reported. Portable deployment remains unclear because the system still depends on probe placement and corpus-style recording conditions. Recognition from ultrasound and video under controlled corpus recording. Overclaim risk: The paper should be read as a hard benchmark and adaptation study, not as solved multi-speaker SSI..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech recognition from ultrasound and lip video
Modality: ultrasound tongue imaging and lip video
Hardware: ultrasound probe and camera
Body site: lip; tongue
Output: text
Vocabulary: open-vocabulary ASR
Metrics: On TaL80 multi-speaker, silent WER drops from 77.79 raw to 69.84 with fMLLR plus unsupervised adaptation, while modal WER is 39.34 raw; on TaL1 speaker-dependent, silent WER falls from 52.64 raw to 37.94
Evaluation mode: TaL80 multi-speaker and TaL1 speaker-dependent WER across modal, silent, and whispered speech with articulatory-space analysis
Review confidence: high
Overclaim risk: The paper should be read as a hard benchmark and adaptation study, not as solved multi-speaker SSI.

Expert take

The paper is valuable because it does not hide the hard part. Even on the larger TaL80 setup, silent WER remains much worse than modal WER, and adaptation only partly closes the gap. That is a credible result, not a failure: it shows exactly where multi-speaker ultrasound SSI breaks. The secondary analysis matters too, because it confirms that silent speech is slower and occupies a smaller articulatory space, which helps explain why modal-trained recognizers transfer poorly.

True value

The strongest contribution is not headline WER but the diagnosis that silent speech is slower, occupies a smaller articulatory space, and benefits from adaptation but remains far harder than modal speech.

What changed

Canon before

Ultrasound-plus-video SSI work was often small-scale, single-speaker, and weak on speaking-mode mismatch.

Delta from canon

Moves to an 82-speaker corpus and treats silent-versus-modal mismatch as a domain adaptation problem with articulatory analysis alongside recognition.

Position in field

Core multi-speaker ultrasound SSI recognition paper.

Evidence

“ TaL contains two subsets: TaL1 has six recording sessions from a profes- The training data consists of modal speech, so we exploit the sional voice talent and male native English speaker; and TaL80 audio stream to bootstrap a simple supervised feature extractor. has single session recordings from 81 native English speak- The Kaldi speech recognition toolkit [26] is used to force-align ers without voice talent experience. ”

validation_scope · 2. The TaL corpus · confidence 0.99

“ 0.4 2 0.2 Test Set Raw fMLLR Raw fMLLR 1 0.0 1 0 1 2 1 2 3 4 5 multi-speaker + unsupervised adapt Syllable rate difference Modal speech syllable rate TaL80 Figure 2: Syllable rate for modal and silent speech utterances modal 39.34 39.79 41.76 (+2.42) 41.37 (+1.58) in the TaL80 test sets. ”

metric · Table 1: Word error rate on modal, silent, and whispered speech · confidence 0.99

“ Although Finally, we compare utterance duration and size of articu- there are significant differences in terms of duration and articu- latory space with the results obtained from the speech recogni- latory space, they do not directly correlate with WER. tion systems. ”

fact · 4. Analysis · confidence 0.98

Limits

Technical limits

Performance is still far from usable for many silent conditions, and the pipeline relies on careful ultrasound capture plus external adaptation tricks.

Evaluation limits

No live interface evaluation or end-user task is reported.

Deployment limits

Portable deployment remains unclear because the system still depends on probe placement and corpus-style recording conditions.

Scope limits

Recognition from ultrasound and video under controlled corpus recording.