Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks
Preprocessing paper, narrow but legitimate.
Reading guidance
- Verdict
- full-text draft · priority medium-high · confidence high
- Why it matters
- The result is modest but real: ultrasound-based speech-versus-silence detection works reasonably well on one speaker, and silence trimming slightly helps downstream SSI reconstruction.
- What to trust
- Basis: full text. Coverage: high. 3 evidence records back the review.
- What is weak
- Single-speaker data and speech-derived labels limit the result. No cross-speaker validation or live interactive test is provided. Practical robustness to probe shift, silence styles, and new users is unknown. Speech-silence preprocessing for one ultrasound SSI setup. Overclaim risk: Overclaim begins if modest preprocessing gains are read as large SSI quality gains..
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- speech/silence detection for ultrasound-based SSI
- Modality
- ultrasound tongue images
- Hardware
- ultrasound imaging system
- Body site
- tongue
- Output
- speech/silence labels
- Metrics
- Ultrasound VAD reaches 85.2% test accuracy, F1 0.9, and ROC AUC 0.859; with Conv3D+BiLSTM SSI, removing silence yields test MCD 3.05 versus 3.12 when keeping 180 ms silence
- Evaluation mode
- single-speaker TaL1 classification accuracy/AUC plus SSI reconstruction with and without silence removal
- Review confidence
- high
- Overclaim risk
- Overclaim begins if modest preprocessing gains are read as large SSI quality gains.
Expert take
The paper does not solve ultrasound SSI, but it does close a real preprocessing gap. The classifier reaches 85.2% test accuracy with 0.859 ROC AUC on the ultrasound speech/silence task, and the downstream experiment shows that keeping long silence can worsen MCD. The scope remains narrow because everything is single-speaker TaL1 and the gains are incremental rather than transformative.
True value
The result is modest but real: ultrasound-based speech-versus-silence detection works reasonably well on one speaker, and silence trimming slightly helps downstream SSI reconstruction.
What changed
Canon before
Ultrasound SSI systems usually assumed speech frames or used speech-side VAD labels without testing ultrasound-only VAD itself.
Delta from canon
Adds an explicit ultrasound VAD stage and checks how silence removal affects articulatory-to-acoustic synthesis metrics.
Position in field
Core ultrasound SSI preprocessing paper.
Evidence
“ Then we implement a CNN to separate silence and speech frames based on the ultrasound tongue images, so we basically create a VAD algorithm that works with ultrasound images. ”
author_claim · Abstract · confidence 0.99
“ Dev set Test set Accuracy 0.87 0.852 recall 0.94 0.95 precision 0.877 0.864 F1 0.91 0.9 ROC AUC 0.894 0.859 Cohen’s Kappa 0.672 0.57 ”
metric · Table 5. · confidence 0.99
“ Removing silence by VAD VAD + keeping 180ms silence MSE(dev) MSE(test) MCD MSE(dev) MSE(test) MCD conv3D 0.436 0.428 3.15 0.38 0.27 3.28 Conv3D+BiLSTM 0.393 0.41 3.05 0.35 0.26 3.12 ”
metric · Table 7. Training the SSI system with removing or retaining silence from the data · confidence 0.98
Limits
Technical limits
Single-speaker data and speech-derived labels limit the result.
Evaluation limits
No cross-speaker validation or live interactive test is provided.
Deployment limits
Practical robustness to probe shift, silence styles, and new users is unknown.
Scope limits
Speech-silence preprocessing for one ultrasound SSI setup.