2021 · arXiv / imported corpus page · Field expert review · confidence high

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

László Tóth, Amin Honarmandi Shandiz

DOI arXiv

Temporal context helps, but the evidence is a single-speaker vocoder-parameter study.

Verdict: full-text draftPriority: medium-highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium-high · confidence high
Why it matters: Demonstrates that a compact (2+1)D 3D CNN outperforms 2D CNN and CNN+LSTM models for ultrasound video to speech vocoder parameter regression on a standard dataset, improving regression metrics while reducing model complexity and training time.
What to trust: Basis: full text. Coverage: high. 7 evidence records back the review.
What is weak: Single-speaker data; models 13 vocoder coefficients without pitch; no silent articulation or real-time deployment tested; no subjective quality evaluation. Only MSE and R2 are reported; there are no listening tests or intelligibility evaluations. No silent articulation scenario, no real-time study, and no robustness-to-probe-shift analysis. Speaker-dependent direct speech synthesis from read-aloud ultrasound, not general SSI deployment. Overclaim risk: medium-low.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech-reconstruction
Modality: ultrasound
Hardware: Micro ultrasound system by Articulate Instruments Ltd. with 2-4 MHz 64-element convex transducer at 82 fps
Body site: tongue
Output: speech-audio
Metrics: Mean squared error (MSE) and mean R2 coefficient for 13 vocoder parameters predicting Mel-Generalized Cepstral Coefficients (LSP representation) without pitch (F0).
Evaluation mode: Development/test objective comparison across FCN, 2D CNN, 3D CNN with various temporal strides, and re-trained CNN+LSTM under matched parameter count.
Review confidence: high
Overclaim risk: medium-low

Expert take

The paper provides a solid but narrow contribution showing that, on a single-speaker ultrasound dataset, a (2+1)D 3D CNN architecture that processes spaced frames (stride s=6, about 300 ms context) outperforms both 2D CNN and a more complex CNN+LSTM model in regression of vocoder coefficients (13 LSP parameters without F0). The 3D CNN yields test MSE of 0.315 and mean R2 of 0.683 compared to 0.366/0.633 for 2D CNN and 0.336/0.661 for CNN+LSTM. The study is limited to speaker-dependent regression, no pitch modeling, no listening tests, and one female Hungarian speaker. The method shows promise as a simpler temporal front-end for ultrasound SSI but requires more robust evaluation and evaluation of perceptual quality before deployment.

True value

Demonstrates that a compact (2+1)D 3D CNN outperforms 2D CNN and CNN+LSTM models for ultrasound video to speech vocoder parameter regression on a standard dataset, improving regression metrics while reducing model complexity and training time.

What changed

Canon before

Ultrasound SSI systems often processed frames independently or used heavier recurrent stacks to add temporal context.

Delta from canon

Uses spaced temporal context inside a compact 3D CNN instead of a recurrent sequence model.

Position in field

Method paper for speaker-dependent ultrasound-to-acoustic mapping.

Evidence

“ Here, we follow the latter approach, and we investigate the applicability of a special 3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech synthesis, and compare the results with those of a CNN+LSTM model. ”

author_claim · Abstract · confidence 1.00

“ Altogether 438 sentences (approximately half an hour) were recorded from the subject, which was divided into train, development and test sets in a 310-41-87 ratio. ”

fact · 3 Data Acquisition · confidence 1.00

“ The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Co- efficients (MGCC) converted to a Line Spectral Pair representation (LSP), with the signal’s gain being the 13th parameter. ”

fact · 3 Data Acquisition · confidence 1.00

“ 3D Convolutional Network (3D CNN): To enable the processing of video frames sequences, we changed the 2D convolution to 3D convolution in our CNN. ”

fact · 4 Experimental Set-Up · confidence 1.00

“ Network Dev Test Type MSE Mean R2 MSE Mean R2 FCN 0.408 0.599 0.400 0.598 2D CNN 0.377 0.630 0.366 0.633 3D CNN (s=6) 0.321 0.684 0.315 0.683 FCN [5] 0.384 0.619 n/a n/a CNN + LSTM [25] 0.345 0.653 0.336 0.661 ”

metric · 5 Results · confidence 1.00

“ The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Co- efficients (MGCC) converted to a Line Spectral Pair representation (LSP), with the signal’s gain being the 13th parameter. ”

limitation · 6 Conclusions · confidence 1.00

“ Here, we follow the latter approach, and we investigate the applicability of a special 3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech synthesis, and compare the results with those of a CNN+LSTM model. ”

deployment_claim · 6 Conclusions · confidence 1.00

Limits

Technical limits

Single-speaker data; models 13 vocoder coefficients without pitch; no silent articulation or real-time deployment tested; no subjective quality evaluation.

Evaluation limits

Only MSE and R2 are reported; there are no listening tests or intelligibility evaluations.

Deployment limits

No silent articulation scenario, no real-time study, and no robustness-to-probe-shift analysis.

Scope limits

Speaker-dependent direct speech synthesis from read-aloud ultrasound, not general SSI deployment.