2023 · Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23--28, 2023 · Field expert review · confidence high

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Jun Rekimoto

DOI arXiv

Strong whisper-conversion paper, but it remains whisper-based rather than truly silent SSI.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: The practical contribution is a low-friction whispered-speech conversion stack that can run in real time without per-user paired corpora.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: The method still depends on whispered speech rather than truly silent articulation. The core evidence is on whispered conversion datasets and listening studies, not on silent-speech benchmarks. Although real time is shown, the paper does not establish a full silent interface or broad public-use robustness. Whisper-to-normal conversion only; this is not fully silent speech. Overclaim risk: medium.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: whisper-to-normal speech conversion
Modality: whispered speech from an ordinary microphone
Hardware: ordinary microphone
Output: speech-audio
Metrics: On wTIMIT whispers, Google ASR reports WER 44.70 and CER 28.38, while WESPER-converted whispers improve to WER 26.68 and CER 12.70; the HuBERT-base setup pretrained on Librispeech+wTIMIT reaches WER 13.75 and CER 5.47.
Evaluation mode: MOS, MUSHRA, and speech-recognition evaluation on whispered and converted speech
Review confidence: high
Overclaim risk: medium

Expert take

WESPER is strong on its own terms. The paper backs the important claims: common speech units reduce the whisper-normal mismatch, the non-autoregressive stack runs in real time, and recognition improves substantially after conversion. The boundary condition is scope, not evidence quality. This is adjacent to SSI because whispered speech still produces audible input, so it should not be sold as a full silent-speech result.

True value

The practical contribution is a low-friction whispered-speech conversion stack that can run in real time without per-user paired corpora.

What changed

Canon before

Whisper-to-normal conversion typically required paired whisper-normal corpora or speaker-dependent training, making discreet speech interfaces hard to deploy.

Delta from canon

WESPER removes the paired-data requirement by learning common speech units from unpaired whisper and normal speech.

Position in field

A strong adjacent paper on discreet speech interaction, but it sits next to SSI rather than inside fully silent speech.

Evidence

“ Figure 1: WESPER is a real-time whisper-to-normal speech conversion mechanism consisting of a speech-to-unit (STU) en- coder that generates common speech units for whispered and normal utterances using self-supervised pre-training, and a unit-to-speech (UTS) decoder that recovers speech from the speech units. ”

author_claim · ABSTRACT · confidence 0.97

“ It requires a microphone-like device placed very close to the person’s normal voice, or even to another person’s voice. to the mouth, and training is required for users to speak correctly Because the encoder and decoder operate in a non-autoregressive with ingressive speech. ”

deployment_claim · 1 INTRODUCTION · confidence 0.96

“ People with Speech Disorders The results are summarized in Table 2 in terms of word error An important goal of WESPER is the reconstruction of atypical rate (WER) and character error rate (CER), as well as bilingual eval- speech of people with speech disorders or hearing impairments. uation understudy (BLEU). ”

metric · Table 2 · confidence 0.97

“ To overcome these problems, various silent speech input tech- niques have been developed [3, 31, 32, 50, 51, 54]; however, these • We propose a real-time, speaker-independent, vocabulary-free methods require special sensors and have not achieved high accu- whisper-to-normal speech conversion method that can be trained racy in speech recognition, remaining instead at the level of recog- only on unpaired whispers and normal speech. nizing predefined commands. ”

limitation · 1 INTRODUCTION · confidence 0.94

Limits

Technical limits

The method still depends on whispered speech rather than truly silent articulation.

Evaluation limits

The core evidence is on whispered conversion datasets and listening studies, not on silent-speech benchmarks.

Deployment limits

Although real time is shown, the paper does not establish a full silent interface or broad public-use robustness.

Scope limits

Whisper-to-normal conversion only; this is not fully silent speech.