WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
Strong whisper-conversion paper, but it remains whisper-based rather than truly silent SSI.
Reading guidance
- Verdict
- full-text draft · priority medium · confidence high
- Why it matters
- The practical contribution is a low-friction whispered-speech conversion stack that can run in real time without per-user paired corpora.
- What to trust
- Basis: full text. Coverage: high. 4 evidence records back the review.
- What is weak
- The method still depends on whispered speech rather than truly silent articulation. The core evidence is on whispered conversion datasets and listening studies, not on silent-speech benchmarks. Although real time is shown, the paper does not establish a full silent interface or broad public-use robustness. Whisper-to-normal conversion only; this is not fully silent speech. Overclaim risk: medium.
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- whisper-to-normal speech conversion
- Modality
- whispered speech from an ordinary microphone
- Hardware
- ordinary microphone
- Output
- speech-audio
- Metrics
- On wTIMIT whispers, Google ASR reports WER 44.70 and CER 28.38, while WESPER-converted whispers improve to WER 26.68 and CER 12.70; the HuBERT-base setup pretrained on Librispeech+wTIMIT reaches WER 13.75 and CER 5.47.
- Evaluation mode
- MOS, MUSHRA, and speech-recognition evaluation on whispered and converted speech
- Review confidence
- high
- Overclaim risk
- medium
Expert take
WESPER is strong on its own terms. The paper backs the important claims: common speech units reduce the whisper-normal mismatch, the non-autoregressive stack runs in real time, and recognition improves substantially after conversion. The boundary condition is scope, not evidence quality. This is adjacent to SSI because whispered speech still produces audible input, so it should not be sold as a full silent-speech result.
True value
The practical contribution is a low-friction whispered-speech conversion stack that can run in real time without per-user paired corpora.
What changed
Canon before
Whisper-to-normal conversion typically required paired whisper-normal corpora or speaker-dependent training, making discreet speech interfaces hard to deploy.
Delta from canon
WESPER removes the paired-data requirement by learning common speech units from unpaired whisper and normal speech.
Position in field
A strong adjacent paper on discreet speech interaction, but it sits next to SSI rather than inside fully silent speech.
Evidence
“ Figure 1: WESPER is a real-time whisper-to-normal speech conversion mechanism consisting of a speech-to-unit (STU) en- coder that generates common speech units for whispered and normal utterances using self-supervised pre-training, and a unit-to-speech (UTS) decoder that recovers speech from the speech units. ”
author_claim · ABSTRACT · confidence 0.97
“ It requires a microphone-like device placed very close to the person’s normal voice, or even to another person’s voice. to the mouth, and training is required for users to speak correctly Because the encoder and decoder operate in a non-autoregressive with ingressive speech. ”
deployment_claim · 1 INTRODUCTION · confidence 0.96
“ People with Speech Disorders The results are summarized in Table 2 in terms of word error An important goal of WESPER is the reconstruction of atypical rate (WER) and character error rate (CER), as well as bilingual eval- speech of people with speech disorders or hearing impairments. uation understudy (BLEU). ”
metric · Table 2 · confidence 0.97
“ To overcome these problems, various silent speech input tech- niques have been developed [3, 31, 32, 50, 51, 54]; however, these • We propose a real-time, speaker-independent, vocabulary-free methods require special sensors and have not achieved high accu- whisper-to-normal speech conversion method that can be trained racy in speech recognition, remaining instead at the level of recog- only on unpaired whispers and normal speech. nizing predefined commands. ”
limitation · 1 INTRODUCTION · confidence 0.94
Limits
Technical limits
The method still depends on whispered speech rather than truly silent articulation.
Evaluation limits
The core evidence is on whispered conversion datasets and listening studies, not on silent-speech benchmarks.
Deployment limits
Although real time is shown, the paper does not establish a full silent interface or broad public-use robustness.
Scope limits
Whisper-to-normal conversion only; this is not fully silent speech.