2021 · arXiv / imported corpus page · Field expert review · confidence high

Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks

Amin Honarmandi Shandiz, László Tóth

DOI arXiv

Preprocessing paper, narrow but legitimate.

Verdict: full-text draftPriority: medium-highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium-high · confidence high
Why it matters: The result is modest but real: ultrasound-based speech-versus-silence detection works reasonably well on one speaker, and silence trimming slightly helps downstream SSI reconstruction.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: Single-speaker data and speech-derived labels limit the result. No cross-speaker validation or live interactive test is provided. Practical robustness to probe shift, silence styles, and new users is unknown. Speech-silence preprocessing for one ultrasound SSI setup. Overclaim risk: Overclaim begins if modest preprocessing gains are read as large SSI quality gains..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech/silence detection for ultrasound-based SSI
Modality: ultrasound tongue images
Hardware: ultrasound imaging system
Body site: tongue
Output: speech/silence labels
Metrics: Ultrasound VAD reaches 85.2% test accuracy, F1 0.9, and ROC AUC 0.859; with Conv3D+BiLSTM SSI, removing silence yields test MCD 3.05 versus 3.12 when keeping 180 ms silence
Evaluation mode: single-speaker TaL1 classification accuracy/AUC plus SSI reconstruction with and without silence removal
Review confidence: high
Overclaim risk: Overclaim begins if modest preprocessing gains are read as large SSI quality gains.

Expert take

The paper does not solve ultrasound SSI, but it does close a real preprocessing gap. The classifier reaches 85.2% test accuracy with 0.859 ROC AUC on the ultrasound speech/silence task, and the downstream experiment shows that keeping long silence can worsen MCD. The scope remains narrow because everything is single-speaker TaL1 and the gains are incremental rather than transformative.

True value

The result is modest but real: ultrasound-based speech-versus-silence detection works reasonably well on one speaker, and silence trimming slightly helps downstream SSI reconstruction.

What changed

Canon before

Ultrasound SSI systems usually assumed speech frames or used speech-side VAD labels without testing ultrasound-only VAD itself.

Delta from canon

Adds an explicit ultrasound VAD stage and checks how silence removal affects articulatory-to-acoustic synthesis metrics.

Position in field

Core ultrasound SSI preprocessing paper.

Evidence

“ Then we implement a CNN to separate silence and speech frames based on the ultrasound tongue images, so we basically create a VAD algorithm that works with ultrasound images. ”

author_claim · Abstract · confidence 0.99

“ Dev set Test set Accuracy 0.87 0.852 recall 0.94 0.95 precision 0.877 0.864 F1 0.91 0.9 ROC AUC 0.894 0.859 Cohen’s Kappa 0.672 0.57 ”

metric · Table 5. · confidence 0.99

“ Removing silence by VAD VAD + keeping 180ms silence MSE(dev) MSE(test) MCD MSE(dev) MSE(test) MCD conv3D 0.436 0.428 3.15 0.38 0.27 3.28 Conv3D+BiLSTM 0.393 0.41 3.05 0.35 0.26 3.12 ”

metric · Table 7. Training the SSI system with removing or retaining silence from the data · confidence 0.98

Limits

Technical limits

Single-speaker data and speech-derived labels limit the result.

Evaluation limits

No cross-speaker validation or live interactive test is provided.

Deployment limits

Practical robustness to probe shift, silence styles, and new users is unknown.

Scope limits

Speech-silence preprocessing for one ultrasound SSI setup.