← SSI archive · Review rubric

2023 · Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23--28, 2023 · Field expert review · confidence high

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Jun Rekimoto

Strong whisper-conversion paper, but it remains whisper-based rather than truly silent SSI.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict
full-text draft · priority medium · confidence high
Why it matters
The practical contribution is a low-friction whispered-speech conversion stack that can run in real time without per-user paired corpora.
What to trust
Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak
The method still depends on whispered speech rather than truly silent articulation. The core evidence is on whispered conversion datasets and listening studies, not on silent-speech benchmarks. Although real time is shown, the paper does not establish a full silent interface or broad public-use robustness. Whisper-to-normal conversion only; this is not fully silent speech. Overclaim risk: medium.
Read before
SSI review rubric
Read next
SSI archive

Axes

Task
whisper-to-normal speech conversion
Modality
whispered speech from an ordinary microphone
Hardware
ordinary microphone
Output
speech-audio
Metrics
On wTIMIT whispers, Google ASR reports WER 44.70 and CER 28.38, while WESPER-converted whispers improve to WER 26.68 and CER 12.70; the HuBERT-base setup pretrained on Librispeech+wTIMIT reaches WER 13.75 and CER 5.47.
Evaluation mode
MOS, MUSHRA, and speech-recognition evaluation on whispered and converted speech
Review confidence
high
Overclaim risk
medium

Expert take

WESPER is strong on its own terms. The paper backs the important claims: common speech units reduce the whisper-normal mismatch, the non-autoregressive stack runs in real time, and recognition improves substantially after conversion. The boundary condition is scope, not evidence quality. This is adjacent to SSI because whispered speech still produces audible input, so it should not be sold as a full silent-speech result.

True value

The practical contribution is a low-friction whispered-speech conversion stack that can run in real time without per-user paired corpora.

What changed

Canon before

Whisper-to-normal conversion typically required paired whisper-normal corpora or speaker-dependent training, making discreet speech interfaces hard to deploy.

Delta from canon

WESPER removes the paired-data requirement by learning common speech units from unpaired whisper and normal speech.

Position in field

A strong adjacent paper on discreet speech interaction, but it sits next to SSI rather than inside fully silent speech.

Evidence

“ Figure 1: WESPER is a real-time whisper-to-normal speech conversion mechanism consisting of a speech-to-unit (STU) en- coder that generates common speech units for whispered and normal utterances using self-supervised pre-training, and a unit-to-speech (UTS) decoder that recovers speech from the speech units. ”

author_claim · ABSTRACT · confidence 0.97

“ It requires a microphone-like device placed very close to the person’s normal voice, or even to another person’s voice. to the mouth, and training is required for users to speak correctly Because the encoder and decoder operate in a non-autoregressive with ingressive speech. ”

deployment_claim · 1 INTRODUCTION · confidence 0.96

“ People with Speech Disorders The results are summarized in Table 2 in terms of word error An important goal of WESPER is the reconstruction of atypical rate (WER) and character error rate (CER), as well as bilingual eval- speech of people with speech disorders or hearing impairments. uation understudy (BLEU). ”

metric · Table 2 · confidence 0.97

“ To overcome these problems, various silent speech input tech- niques have been developed [3, 31, 32, 50, 51, 54]; however, these • We propose a real-time, speaker-independent, vocabulary-free methods require special sensors and have not achieved high accu- whisper-to-normal speech conversion method that can be trained racy in speech recognition, remaining instead at the level of recog- only on unpaired whispers and normal speech. nizing predefined commands. ”

limitation · 1 INTRODUCTION · confidence 0.94

Limits

Technical limits

The method still depends on whispered speech rather than truly silent articulation.

Evaluation limits

The core evidence is on whispered conversion datasets and listening studies, not on silent-speech benchmarks.

Deployment limits

Although real time is shown, the paper does not establish a full silent interface or broad public-use robustness.

Scope limits

Whisper-to-normal conversion only; this is not fully silent speech.