2021 · arXiv / imported corpus page · Field expert review · confidence high

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Csapó Tamás Gábor, László Tóth, Gosztolya Gábor, Alexandra Markó

arXiv

Helpful side information, not standalone SSI.

Verdict: full-text draftPriority: medium-highConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium-high · confidence high
Why it matters: The full text supports a narrower claim than the title suggests: the best system is not a standalone SSI but a text-to-speech system helped by ultrasound side information under limited data.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: Speaker-dependent training, limited data, and sensitivity to probe misalignment constrain the result. All evidence is speaker-dependent offline synthesis metrics without listening tests. The best system still requires text input and a probe-mounted headset. Articulatory-augmented TTS, not pure silent-speech reconstruction. Overclaim risk: Overclaim begins if the fused system is described as a standalone SSI rather than text-conditioned synthesis..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech synthesis from text plus ultrasound articulatory input
Modality: text and ultrasound tongue images
Hardware: Articulate Instruments Micro ultrasound system with probe-fixing headset
Body site: tongue
Output: speech audio
Metrics: Combined text plus ultrasound yields the lowest test MCD for all 8 speakers, for example 5.442 for 03mn and 5.236 for 06fe versus 5.652 and 5.447 for text-only; ultrasound-only remains far worse at 7.153 and 7.050
Evaluation mode: 8-speaker dev/test evaluation with MCD, BAP, F0-RMSE, F0-CORR, F0-VUV and probe-misalignment analysis
Review confidence: high
Overclaim risk: Overclaim begins if the fused system is described as a standalone SSI rather than text-conditioned synthesis.

Expert take

The paper earns its improvement claim. For every speaker in Table 1, the combined text-plus-ultrasound system beats text-only MCD by a small but consistent margin, which is exactly what you want from side information in a limited-data setup. But the same full text also states the system is not suitable for direct SSI because the best pipeline needs both text and articulatory input, and probe misalignment remains a serious failure mode.

True value

The full text supports a narrower claim than the title suggests: the best system is not a standalone SSI but a text-to-speech system helped by ultrasound side information under limited data.

What changed

Canon before

Articulatory-to-speech work usually used ultrasound alone, while DNN-TTS usually used text alone.

Delta from canon

Combines conventional text-side linguistic features with ultrasound-derived articulatory features inside one DNN-TTS pipeline.

Position in field

Adjacent articulatory-synthesis paper relevant to SSI but not itself a complete silent-speech interface.

Evidence

“ Articulatory features derived from medical imaging tation features as output for synthesizing speech, at a 5 ms data (e.g. ultrasound or MRI) have not been used before for frame step with the WORLD vocoder (60-dimensional MGC, 5- additional input of HMM-TTS or DNN-TTS. dimensional BAP, and 1-dimensional LF0, with delta and delta- delta features). ”

author_claim · Abstract · confidence 0.99

“ We trained the networks for 25 epochs with a warm-up of 10 epochs, applying early stopping, Table 1: MCD errors on the dev/test set. and a learning rate of 0.002 after that with exponential decay. ”

metric · Table 1: MCD errors on the dev/test set. · confidence 0.99

“ Besides, we analyze the ultrasound tongue research fields investigating such relationship is articulatory-to- recordings of several speakers, and show that misalignments in acoustic (forward) mapping, when the input is a speech-related the ultrasound transducer positioning can have a negative effect biosignal (e.g. tongue or lip movement), and the target is syn- on the final synthesis performance. thesized speech. ”

limitation · 4. Effect of ultrasound transducer position · confidence 0.98

Limits

Technical limits

Speaker-dependent training, limited data, and sensitivity to probe misalignment constrain the result.

Evaluation limits

All evidence is speaker-dependent offline synthesis metrics without listening tests.

Deployment limits

The best system still requires text input and a probe-mounted headset.

Scope limits

Articulatory-augmented TTS, not pure silent-speech reconstruction.