← SSI archive · Review rubric

2021 · arXiv / imported corpus page · Field expert review · confidence high

Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks

Amin Honarmandi Shandiz, László Tóth

Preprocessing paper, narrow but legitimate.

Verdict: full-text draftPriority: medium-highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict
full-text draft · priority medium-high · confidence high
Why it matters
The result is modest but real: ultrasound-based speech-versus-silence detection works reasonably well on one speaker, and silence trimming slightly helps downstream SSI reconstruction.
What to trust
Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak
Single-speaker data and speech-derived labels limit the result. No cross-speaker validation or live interactive test is provided. Practical robustness to probe shift, silence styles, and new users is unknown. Speech-silence preprocessing for one ultrasound SSI setup. Overclaim risk: Overclaim begins if modest preprocessing gains are read as large SSI quality gains..
Read before
SSI review rubric
Read next
SSI archive

Axes

Task
speech/silence detection for ultrasound-based SSI
Modality
ultrasound tongue images
Hardware
ultrasound imaging system
Body site
tongue
Output
speech/silence labels
Metrics
Ultrasound VAD reaches 85.2% test accuracy, F1 0.9, and ROC AUC 0.859; with Conv3D+BiLSTM SSI, removing silence yields test MCD 3.05 versus 3.12 when keeping 180 ms silence
Evaluation mode
single-speaker TaL1 classification accuracy/AUC plus SSI reconstruction with and without silence removal
Review confidence
high
Overclaim risk
Overclaim begins if modest preprocessing gains are read as large SSI quality gains.

Expert take

The paper does not solve ultrasound SSI, but it does close a real preprocessing gap. The classifier reaches 85.2% test accuracy with 0.859 ROC AUC on the ultrasound speech/silence task, and the downstream experiment shows that keeping long silence can worsen MCD. The scope remains narrow because everything is single-speaker TaL1 and the gains are incremental rather than transformative.

True value

The result is modest but real: ultrasound-based speech-versus-silence detection works reasonably well on one speaker, and silence trimming slightly helps downstream SSI reconstruction.

What changed

Canon before

Ultrasound SSI systems usually assumed speech frames or used speech-side VAD labels without testing ultrasound-only VAD itself.

Delta from canon

Adds an explicit ultrasound VAD stage and checks how silence removal affects articulatory-to-acoustic synthesis metrics.

Position in field

Core ultrasound SSI preprocessing paper.

Evidence

“ Then we implement a CNN to separate silence and speech frames based on the ultrasound tongue images, so we basically create a VAD algorithm that works with ultrasound images. ”

author_claim · Abstract · confidence 0.99

“ Dev set Test set Accuracy 0.87 0.852 recall 0.94 0.95 precision 0.877 0.864 F1 0.91 0.9 ROC AUC 0.894 0.859 Cohen’s Kappa 0.672 0.57 ”

metric · Table 5. · confidence 0.99

“ Removing silence by VAD VAD + keeping 180ms silence MSE(dev) MSE(test) MCD MSE(dev) MSE(test) MCD conv3D 0.436 0.428 3.15 0.38 0.27 3.28 Conv3D+BiLSTM 0.393 0.41 3.05 0.35 0.26 3.12 ”

metric · Table 7. Training the SSI system with removing or retaining silence from the data · confidence 0.98

Limits

Technical limits

Single-speaker data and speech-derived labels limit the result.

Evaluation limits

No cross-speaker validation or live interactive test is provided.

Deployment limits

Practical robustness to probe shift, silence styles, and new users is unknown.

Scope limits

Speech-silence preprocessing for one ultrasound SSI setup.