SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
A solid proof of concept that reconstructs speech audio from ultrasound for controlling unmodified smart speakers, showcasing important system design insight despite prototype limitations in latency, hardware bulk, and speaker dependency.
Reading guidance
- Verdict
- full-text draft · priority high · confidence high
- Why it matters
- This paper's key value lies in reframing silent speech interaction as a speech regeneration and ecosystem reuse problem, deploying a two-stage DNN approach to produce audio from ultrasonic tongue and jaw imaging. It demonstrates practical integration with existing voice agents, highlighting a promising architectural direction distinct from direct command recognition methods.
- What to trust
- Basis: full text + existing expert seed. Coverage: high. 7 evidence records back the review.
- What is weak
- Speaker-dependent training; latency unsuitable for real-time use (2.61 s per utterance); differences in silent versus voiced articulation require user adaptation; bulky hardware; potential unknown safety issues with continuous ultrasound emission; small vocabulary size. Only two participants were used for training and testing; the command vocabulary is small (four Alexa commands) in end-to-end testing, repeated five times each; and no speaker-independent or open vocabulary evaluations were performed. The device requires a bulky 3.5 MHz convex probe attached under the jaw and digitized display capture; continuous ultrasonic emission safety is not evaluated; the system is not wearable or miniaturized; and its latency (~2.61 s per command) is too slow for real-time use. Prototype supports only a fixed small command vocabulary in speaker-dependent training; no demonstration of open vocabulary or continuous real-time interaction. Overclaim risk: medium.
- Read before
- SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatography
- Read next
- NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction
Axes
- Task
- speech-reconstruction
- Modality
- ultrasound
- Hardware
- 3.5 MHz convex ultrasound probe attached under the jaw, with ultrasound images captured to display monitor and digitized video stored
- Body site
- jaw; oral-cavity
- Output
- speech-audio
- Vocabulary
- Command-level
- Metrics
- Network 1 alone achieved an average 42.5% smart speaker command recognition success; Network 1 plus Network 2 achieved 65.0%; ground-truth audio reached 90.0%. Google speech-to-text word error rates were 41.03% for Network 1 outputs and 33.56% for Network 2 outputs, versus 20.61% for ground truth audio.
- Evaluation mode
- Quantitative smart speaker success rates, word error rates with Google speech-to-text, and qualitative user adaptation observations.
- Review confidence
- high
- Overclaim risk
- medium
Expert take
Kimura et al. present a well-engineered proof of concept for silent speech interaction using ultrasonic imaging and deep neural networks. Their primary contribution is a two-stage neural pipeline converting ultrasonic images captured below the jaw into Mel-spectrogram features and refining those into audio signals, which can then control unmodified commercial smart speakers like Amazon Alexa. This architectural decision to reconstruct audio, rather than directly classify commands, is a significant reframing within SSI research. The study provides modest yet concrete quantitative results: a 65% command recognition success rate with the two-network pipeline, 33.56% word error rate on Google STT, and demonstration of system use with limited command sets. They explicitly discuss limitations such as speaker dependence, system latency (~2.61 s), bulky hardware, and user adaptation requirements for silent speech without vocal fold vibration. These aspects clarify that while the prototype is not ready for real-world deployment or continuous real-time interaction, it is a strong conceptual and technical foundation for future research on SSI architectures that reuse existing voice ecosystems via speech regeneration.
True value
This paper's key value lies in reframing silent speech interaction as a speech regeneration and ecosystem reuse problem, deploying a two-stage DNN approach to produce audio from ultrasonic tongue and jaw imaging. It demonstrates practical integration with existing voice agents, highlighting a promising architectural direction distinct from direct command recognition methods.
What changed
Canon before
Most prior silent speech interfaces recognized commands directly or relied on visible cameras or other sensors, lacking integration with unchanged smart speaker ecosystems.
Delta from canon
This work shifts SSI from direct command recognition to speech audio regeneration that can be fed to standard speech recognition engines and smart speakers without modification.
Position in field
An early and influential demonstration of ultrasound-based speech regeneration SSI with system-level insights stronger than its present prototype performance.
Evidence
“ Recent ing a method known as lip reading, images of the mouth researchers have challenged to use deep neural networks of the speaker or the entire face are captured by a camera, with ultrasound imaging for silent speech [7, 51]; however, and the content of the utterance is estimated from those they are not based on convolutional neural networks and are images [52]. ”
author_claim · Abstract · confidence 1.00
“ Approximately 500 speech and Network 2 was unclear, we observed that the sound commands were collected from each collaborator (Table ??). generated by Network 2 was better than that generated by For each command, as well as the voice utterance, a video of Network 1 (Examples of the output audio signals are given the ultrasonic images was recorded. ”
metric · 4 RESULTS · confidence 0.95
“ Approximately 500 speech and Network 2 was unclear, we observed that the sound commands were collected from each collaborator (Table ??). generated by Network 2 was better than that generated by For each command, as well as the voice utterance, a video of Network 1 (Examples of the output audio signals are given the ultrasonic images was recorded. ”
validation_scope · 4 RESULTS · confidence 0.98
“ For this test, the As our model is speaker dependent, both Network 1 and participants spoke the following four commands, five times Network 2 are trained for each speaker. ”
limitation · 5 END-TO-END EVALUATION · confidence 0.97
“ To speaker (Amazon Echo and Amazon Echo Show), and this increase the number of test sets, data augmentation by apply- test confirmed that the generated sounds can control smart ing Gaussian noise to the input Mel-scale spectrum vectors speakers. ”
deployment_claim · 4 RESULTS · confidence 0.95
“ Approximately 500 speech and Network 2 was unclear, we observed that the sound commands were collected from each collaborator (Table ??). generated by Network 2 was better than that generated by For each command, as well as the voice utterance, a video of Network 1 (Examples of the output audio signals are given the ultrasonic images was recorded. ”
actual_novelty · 3 SYSTEM ARCHITECTURE OF SOTTOVOCE · confidence 0.90
“ The total processing time (including video processing, neural networks processing, and conversion of the Mel-scale spectrum to an audio wave) was 2.61 s. ”
deployment_claim · 3 SYSTEM ARCHITECTURE OF SOTTOVOCE · confidence 0.90
Limits
Technical limits
Speaker-dependent training; latency unsuitable for real-time use (2.61 s per utterance); differences in silent versus voiced articulation require user adaptation; bulky hardware; potential unknown safety issues with continuous ultrasound emission; small vocabulary size.
Evaluation limits
Only two participants were used for training and testing; the command vocabulary is small (four Alexa commands) in end-to-end testing, repeated five times each; and no speaker-independent or open vocabulary evaluations were performed.
Deployment limits
The device requires a bulky 3.5 MHz convex probe attached under the jaw and digitized display capture; continuous ultrasonic emission safety is not evaluated; the system is not wearable or miniaturized; and its latency (~2.61 s per command) is too slow for real-time use.
Scope limits
Prototype supports only a fixed small command vocabulary in speaker-dependent training; no demonstration of open vocabulary or continuous real-time interaction.