2017 · arXiv / imported corpus page · Field expert review · confidence high

Updating the silent speech challenge benchmark with deep learning

Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, B. Denby

arXiv

Benchmark update with a real, reproducible WER gain.

Verdict: full-text draftPriority: highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: This is a benchmark-anchored SSI paper with a real methodological payoff: on the same archive and decoding framing, DNN-HMM recognition almost triples performance over the original benchmark value.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: Everything remains single-speaker, controlled, and benchmark-bounded; there is no evidence of speaker independence or real-world robustness. The archive is small and single-speaker, so large gains on this benchmark do not automatically transfer to broader SSI deployment. No live deployment, calibration burden analysis, or user-facing interface is reported. Speaker-dependent ultrasound-plus-lip benchmark study. Overclaim risk: Low for the benchmark update claim, medium if the gain is generalized to realistic multi-speaker SSI..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech recognition
Modality: ultrasound tongue imaging and lip video
Hardware: Ultrasound transducer placed under the chin plus a small lip video camera
Body site: tongue / lips
Output: text
Vocabulary: continuous-speech benchmark vocabulary
Metrics: Table 2 reports 17.4% WER for the original HTK benchmark, 17.4% for a Kaldi GMM-HMM reproduction, and 6.45% for the Kaldi DNN-HMM with 30-element DCT features. Table 3 shows DNN WER of 11.44% with the WSJ LM and 6.45% with the task-specific CSR LM.
Evaluation mode: Direct comparison to the original Silent Speech Challenge benchmark plus language-model and feature sweeps on the same single-speaker archive.
Review confidence: high
Overclaim risk: Low for the benchmark update claim, medium if the gain is generalized to realistic multi-speaker SSI.

Expert take

The strongest part of the paper is methodological discipline. It does not introduce a new sensing modality; it keeps the Silent Speech Challenge framing intact long enough to ask what deep learning changes. Table 2 is the headline: the original 17.4% WER benchmark remains 17.4% under a Kaldi GMM-HMM reproduction, then drops to 6.45% with a DNN-HMM. Table 3 also shows that language-model choice matters, with DNN WER moving from 11.44% under the WSJ LM to 6.45% under the task-specific CSR LM. That makes this a benchmark-updating paper with genuine value, but still within a single-speaker controlled archive rather than a broadly deployable SSI.

True value

This is a benchmark-anchored SSI paper with a real methodological payoff: on the same archive and decoding framing, DNN-HMM recognition almost triples performance over the original benchmark value.

What changed

Canon before

The Silent Speech Challenge was a fixed ultrasound-plus-lip benchmark whose published reference point was 17.4% WER with an HTK GMM-HMM system.

Delta from canon

The paper reruns the benchmark with matched decoding in Kaldi, then shows a DNN-HMM can cut WER to 6.45% and that language-model choice matters materially.

Position in field

Canonical benchmark update for ultrasound-plus-lip silent speech recognition.

Evidence

“ The 2010 Silent Speech Challenge benchmark is updated with new results obtained in a Deep Learning strategy, using the same input features and decoding strategy as in the original article. ”

author_claim · Abstract · confidence 1.00

“ In 2010, an US + lip video SSI trained on the well-known TIMIT corpus achieved, with the aid of a language model (LM), a single speaker WER of 17.4% (84.2% “correct” word rate) on an independent test corpus [52], representing a promising early SSI result on a benchmark continuous speech recognition task. ”

validation_scope · 1.2. The Silent Speech Challenge benchmark · confidence 1.00

“ Comparison with original HTK result of [52], using 30-element DCT features Error HTK Kaldi Kaldi SSC GMM- DNN- Benchmark HMM HMM Insertion 17 41 6 Deletion 23 17 8 Substitution 138 120 52 Number of 1023 1023 1023 words Correct 862 886 963 words Correct rate 84.26% 86.61% 94.13% WER 17.4% 17.4% 6.45% ”

metric · Table 2. Comparison with original HTK result of [52], using 30-element DCT features · confidence 1.00

“ Comparing results for the 2 different LM, for 30-element feature vectors of both types WER (%) LM lm_csr_5k_ lm_wsj_5_ nvp_2gram nvp_2gram DCT monophone 45.55 40.47 Triphone2b 17.40 12.71 DNN 11.44 6.45 DAE monophone 58.55 59.92 Triphone2b 21.70 14.76 DNN 13.98 7.72 ”

metric · Table 3. Comparing results for the 2 different LM, for 30-element feature vectors of both types · confidence 1.00

Limits

Technical limits

Everything remains single-speaker, controlled, and benchmark-bounded; there is no evidence of speaker independence or real-world robustness.

Evaluation limits

The archive is small and single-speaker, so large gains on this benchmark do not automatically transfer to broader SSI deployment.

Deployment limits

No live deployment, calibration burden analysis, or user-facing interface is reported.

Scope limits

Speaker-dependent ultrasound-plus-lip benchmark study.