2026 · CHI '26 / arXiv · Field expert review · confidence high

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

Jun Rekimoto, Yu Nishimura, Bojian Yang

A strong deployment-focused speech interface leveraging a novel nose-pad dual-sensor configuration and multimodal fusion to enable robust low-audibility speech interaction with AI under noise, backed by extensive evaluation.

Verdict: full-text draftPriority: highConfidence: highBasis: full text + existing expert seedCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: Demonstrates a feasible, socially acceptable, and noise-robust wearable speech interface design embracing discreet whispered speech input complemented by a novel hardware sensor fusion architecture and rigorous multi-modal enhancement and ASR evaluation, moving beyond small-vocabulary silent speech towards open-vocabulary AI conversation.
What to trust: Basis: full text + existing expert seed. Coverage: high. 4 evidence records back the review.
What is weak: Fusion model not fully streaming; whispered vibration signals remain weak limiting enhancement quality; performance under extreme noise favors vibration sensor input only at very low SNR. Evaluations used synthetic noise corruption for ASR and objective metrics; in-the-wild testing was qualitative with limited environments; no unseen words generalization tested. Fully streaming continuous operation, smartphone integration, adaptive sensor gating based on SNR, and calibration for physiological factors such as nasal patency remain future work. Targets low-audibility whispered speech, not fully silent speech without any acoustic leakage; assumes hand-covering mouth for privacy. Overclaim risk: medium.
Read before: SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
Read next: SSI archive

Axes

Task: speech-enhancement; speech-recognition for whispered and low-audibility speech to support always-available AI voice interaction
Modality: acoustic; vibration; multimodal
Hardware: MEMS microphone (Syntiant SPH0141LM4H-1) and MEMS vibration sensor (Syntiant V2S200D) integrated in smart glasses nose pads providing synchronized PDM output.
Body site: face; nose; oral-cavity
Output: speech-audio
Vocabulary: Open-vocabulary speech
Metrics: Recognition accuracy expressed in word error rate (WER) and character error rate (CER) under varying noise conditions; PESQ and STOI perceptual audio quality; MUSHRA subjective audio quality scores.
Evaluation mode: Quantitative ASR accuracy (WER, CER) on held-out data, objective perceptual quality metrics (PESQ, STOI), MUSHRA subjective ratings with 50 evaluators, and qualitative in-the-wild recordings in four real-world environments.
Review confidence: high
Overclaim risk: medium

Expert take

NasoVoce represents a serious, well-constructed system integrating a discreet smart-glasses nose-pad form factor housing a MEMS microphone and vibration sensor. By fusing these complementary modalities via a novel dual-input D-DCCRN enhancement model trained with Whisper-based knowledge distillation, it robustly captures low-volume and whispered speech. The extensive dataset of 104 hours from 45 participants, rigorous evaluation covering ASR metrics, objective speech quality evaluation, a large-scale subjective study, and real-world qualitative trials establish its practical feasibility. While the core method is validated, deployment aspects such as continuous streaming, adaptive sensor fusion depending on context, smartphone integration, and accommodation for physiological variability like nasal patency require further study and development.

True value

Demonstrates a feasible, socially acceptable, and noise-robust wearable speech interface design embracing discreet whispered speech input complemented by a novel hardware sensor fusion architecture and rigorous multi-modal enhancement and ASR evaluation, moving beyond small-vocabulary silent speech towards open-vocabulary AI conversation.

What changed

Canon before

Wearable silent speech and whispered speech interfaces typically struggled to balance wearability, vocabulary size, noise robustness, and social acceptability, often limited to small command sets or requiring obtrusive sensors.

Delta from canon

NasoVoce innovates by mounting a MEMS microphone and vibration sensor at the smart glasses nose pad, capturing complementary air- and skin-conducted signals for robust low-volume and whispered speech capture, combined with a dual-input enhancement model.

Position in field

A strong modern SSI-adjacent wearable speech paper that refocuses from fully silent recognition to discreet, robust AI voice interaction with open vocabulary.

Evidence

“ Lip-reading techniques have an extended • A whisper input mechanism by integrating a microphone and vocabulary size, but camera-based systems covering the face impose vibration sensor into nose pads; This design reduces ambient high wearability costs, limiting their suitability for daily use. noise for both normal and whispered speech while maintaining As a method related to silent speech, “whispered speech” has the smart glasses’ appearance and wearability. also been proposed. ”

author_claim · Abstract · confidence 1.00

“ Lip-reading techniques have an extended • A whisper input mechanism by integrating a microphone and vocabulary size, but camera-based systems covering the face impose vibration sensor into nose pads; This design reduces ambient high wearability costs, limiting their suitability for daily use. noise for both normal and whispered speech while maintaining As a method related to silent speech, “whispered speech” has the smart glasses’ appearance and wearability. also been proposed. ”

actual_novelty · 3 NasoVoce · confidence 1.00

“ 0 0 recorded MEMS vibration sensor (Vib) channel, microphone sig- 20 10 0 -10 20 10 0 -10 Signal to noise ratio [dB] Signal to noise ratio [dB] nals obtained by mixing the clean speech with various noises (Mic conditions), and signals enhanced from Vib+Mic by the proposed Figure 7: Speech recognition accuracy (WER, CER) for nor- audio-enhancement model. mal and whispered speech: MEMS microphone (Mic), MEMS vibration sensor (Vib), audio enhancement by D-DCCRN 5.1 ASR (WER /CER) Fig. ”

metric · 5 Evaluation · confidence 1.00

“ Lip-reading techniques have an extended • A whisper input mechanism by integrating a microphone and vocabulary size, but camera-based systems covering the face impose vibration sensor into nose pads; This design reduces ambient high wearability costs, limiting their suitability for daily use. noise for both normal and whispered speech while maintaining As a method related to silent speech, “whispered speech” has the smart glasses’ appearance and wearability. also been proposed. ”

limitation · 6 Discussions · confidence 1.00

Limits

Technical limits

Fusion model not fully streaming; whispered vibration signals remain weak limiting enhancement quality; performance under extreme noise favors vibration sensor input only at very low SNR.

Evaluation limits

Evaluations used synthetic noise corruption for ASR and objective metrics; in-the-wild testing was qualitative with limited environments; no unseen words generalization tested.

Deployment limits

Fully streaming continuous operation, smartphone integration, adaptive sensor gating based on SNR, and calibration for physiological factors such as nasal patency remain future work.

Scope limits

Targets low-audibility whispered speech, not fully silent speech without any acoustic leakage; assumes hand-covering mouth for privacy.