Silent speech papers by publication data.
Browse the SSI review database by year, citation count, title, author, or page type.
Paper pages are expert evaluations, not abstract reposts. Citation counts come from OpenAlex when available.
Browse papers
106 papers shown
Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading
The paper advances silent speech synthesis by leveraging masked training to robustly fuse electromyography and lipreading, showing improved performance and resilience, but adaptation to laryngectomized users remains challenging.
A 1000-hour EEG-EMG-audio dataset of Japanese speech production
A 1020-hour multimodal EEG-EMG-audio dataset for Japanese overt speech vastly expands data resources, enabling diverse speech decoding and EEG research, though generalization is limited by three participants and no decoding benchmarks are presented.
Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
The study convincingly shows zero-shot imagined speech decoding by mapping MEG imagery to listened responses and decoding with a listened-trained contrastive model, marking a promising data-efficient advance despite limited vocabulary and hardware constraints.
NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction
A strong deployment-focused speech interface leveraging a novel nose-pad dual-sensor configuration and multimodal fusion to enable robust low-audibility speech interaction with AI under noise, backed by extensive evaluation.
A cross-species neural foundation model for end-to-end speech decoding
Introduces a cross-species pretrained transformer encoder enabling state-of-the-art end-to-end neural speech decoding with audio-LLMs, improving accuracy and enabling imagined speech decoding, but latency and real-time deployment remain challenges.
SonicVisionLM: Playing Sound with Vision Language Models
A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.
IR-UWB Radar-Based Contactless Silent Speech Recognition of Vowels, Consonants, Words, and Phrases
This paper introduces FERASEC, a novel radar feature extraction enabling the first contactless IR-UWB radar phoneme-level silent speech recognition with 86% vowel and 81% consonant accuracy, surpassing raw signal baselines and signifying a key advance in practical silent speech interfaces.
Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency
Strong SSI system combining a novel ultrasensitive throat textile strain sensor with an efficient 1D residual CNN, achieving high word classification accuracy with low computational cost and promising few-shot transfer to new users and words on small vocabularies.
Distributed pressure matching strategy using diffusion adaptation
Distributed rootless pressure matching for personal sound zones is presented and validated in simulation, not an SSI paper.
Advancing Test-Time Adaptation for Acoustic Foundation Models in Open-World Shifts
Strong acoustic ASR paper proposing confidence-weighted frame adaptation plus temporal consistency regularization for stable test-time adaptation under wild acoustic conditions, yielding substantial WER improvements across noise, accents, and singing datasets.
Sound Source Localization is All about Cross-Modal Alignment
Provides a novel multi-positive contrastive framework enhancing semantic audio-visual alignment for sound source localization. Strong experimental evidence supports claims. Method is outside the SSI domain.
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
Strong lip-to-speech system that reduces ambiguity via SSL linguistic conditioning, variance predictors, and flow-based refinement, achieving near-vocoded naturalness and improved intelligibility on standard datasets.
An Initial Exploration: Learning to Generate Realistic Audio for Silent Video
Honest exploratory comparison showing transformer-based model outperforms deep-fusion CNN and Wavenet for generating low-to-mid frequency audio from silent video in a small curated dataset; not a speech or SSI paper.
Audio Knowledge Empowered Visual Speech Recognition
The paper advances visual speech recognition by selectively transferring refined linguistic audio knowledge via a learned compact memory and cross-attention injection, improving benchmark WERs over prior audio-assisted methods without requiring audio inputs during inference.
Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface
This paper delivers a practical spelling-focused sEMG silent speech system by compressing a ResNet ensemble into a lightweight model achieving 85.9% accuracy on the NATO alphabet with portable hardware, but remains limited to 5 young male subjects and speaker-dependent scenarios.
Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data
Strong clinical fluency regression method validated on noisy read speech from aphasia patients; outside core SSI modalities and use-cases.
Exploring how a Generative AI interprets music
A thorough interpretability analysis reveals that MusicVAE uses only a few dozen latent dimensions to encode music with pitch and rhythm strongly represented in the first two, but the work has no direct relevance to silent speech interfaces.
Audio-visual video-to-speech synthesis with synthesized input audio
The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Strong AVS result, outside SSI: the useful idea is audio-conditioned decoder queries plus dynamic mask prediction.
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
Strong modular SSL-based lip-to-speech synthesis paper that innovatively maps lip SSL features to disentangled speech embeddings before vocoder synthesis, demonstrating improved intelligibility and robustness across benchmark datasets.
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
This video-conditioned AVO system innovatively supervises alignment by predicting discrete speech units rather than reconstructing acoustic features, leading to better lip-sync and speech quality on a single-speaker dataset; however, it is not an SSI interface paper.
Large-scale unsupervised audio pre-training for video-to-speech synthesis
Good decoder-transfer pretraining improves video-to-speech quality on several benchmarks, but WER gains are not consistent. A useful methodological contribution with strong benchmark support, adjacent to SSI rather than a deployable system.
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
Strong full-text paper demonstrating that inference-time text guidance via ASR classifier is key to significantly improved intelligibility in lip-to-speech synthesis on challenging in-the-wild video datasets, outperforming prior baselines.
Intelligible Lip-to-Speech Synthesis with Speech Units
Speech units as a pseudo-text target enable strong content supervision that substantially cuts WER without text labels, and the multi-input vocoder improves speech quality from blurry mel outputs, yielding a state-of-the-art lip-to-speech system on LRS benchmarks.
Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks
Strong full-text-backed evidence that most of the gain comes from fast input alignment, not from inventing a new SSI stack.
Zero-shot personalized lip-to-speech synthesis with face image based voice control
Demonstrates effective zero-shot voice control in Lip2Speech by leveraging face image-based speaker embeddings, validated on GRID corpus but constrained by dataset vocabulary and speech naturalness.
Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning
Strong viseme-level metric learning approach reduces silent speech VSR errors on a small 10-phrase dataset, notably achieving parity with baselines using much less silent data.
Conditional Generation of Audio from Video via Foley Analogies
The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.
Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training
Strong SSI paper improving silent speech reconstruction by generating pseudo acoustic targets and using domain adversarial training to address domain mismatch; validated with TaL dataset showing substantial WER and MOS gains over TaLNet.
WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
Strong whisper-conversion paper, but it remains whisper-based rather than truly silent SSI.
Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech
The paper presents a strong multi-speaker TTS phrasing approach leveraging speaker-conditioned BERT embeddings and pause duration categories to improve pause insertion precision and synthetic speech rhythm; however, it is out-of-scope for SSI as it focuses on audible speech synthesis only.
LipLearner: Customizable Silent Speech Interactions on Mobile Devices
LipLearner is a strong mobile silent speech system that uniquely closes the loop from few-shot lipreading model design to practical on-device customization and keyword spotting, demonstrated robustly in real-world conditions and a user study.
Towards Neural Decoding of Imagined Speech based on Spoken Speech
Transfer of CSP+SVM models trained on spoken speech EEG to imagined speech achieves comparable, though slightly lower, accuracy within a limited 5-class, 7-subject offline EEG setup, with visual imagery control supporting specificity.
Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation
Strong causal PSE paper, not SSI. The pVAD-guided loss is the part that holds up under full-text reading.
Movement Detection of Tongue and Related Body Parts Using IR-UWB Radar
Good sensing primitive, very small task.
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
The real contribution is not just another VAE-GAN; it is turning lip-to-speech into an arbitrary-speaker problem with credible low-data adaptation.
An Anchor-Free Detector for Continuous Speech Keyword Spotting
Strong CSKWS paper, not SSI. The detection framing and unknown class are the points that hold up in full text.
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis
This paper matters because it makes unconstrained lip-to-speech materially faster without obviously sacrificing quality.
Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks
An empirically supported, incremental advancement showing that hybrid 3D-CNN plus ConvLSTM models modestly outperform prior ultrasound tongue video SSI architectures in mel-spectrogram regression accuracy and model efficiency on single-speaker data.
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
The paper is really about disentangling identity, and that is why the unseen-speaker results hold up.
Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information
Strong evidence that silence segments in HuBERT representations uniquely store speaker information, improving SID accuracy when silence is augmented; analytical SSL probing paper outside silent speech interface field.
SVTS: Scalable Video-to-Speech Synthesis
A key scaling contribution that demonstrates simple spectrogram prediction plus pretrained vocoder pipelines outperform prior complex models on diverse datasets, marking foundational progress in large-scale video-to-speech synthesis.
Listen only to me! How well can target speech extraction handle false alarms?
Strong paper for false-alarm handling in TSE, wrong domain if someone tries to count it as SSI progress.
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video
The key idea is not generic fusion; it is storing cross-modal correspondences so video-only decoding can recover some audio-side structure later.
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion
The real move is importing structure from voice conversion, not just adding another speaker embedding.
Supervised and Self-supervised Pretraining Based COVID-19 Detection Using Acoustic Breathing/Cough/Speech Signals
Sound classification paper, not SSI.
VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over
VisualTTS effectively improves lip-speech synchronization in scripted voice over by conditioning TTS on lip video, but does not tackle silent speech decoding or unscripted scenarios.
Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language
SSRNet innovatively applies duration-aware Seq2Seq modeling and tonal multitask learning to reconstruct intelligible Mandarin speech from facial sEMG signals, markedly improving performance over prior methods but remains speaker-dependent with limited deployment evaluation.
SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatography
SilentSpeller is a strong, rigorously tested SSI system that reframes silent speech as silent spelling, enabling large vocabulary, live text entry, and walking robustness with in-mouth electropalatography sensors.
SA-SDR: A novel loss function for separation of meeting style data
Elegant loss fix, not SSI.
Advances and Challenges in Deep Lip Reading
Good survey, not a model result.
Sub-word Level Lip Reading With Visual Attention
Major lip-reading gain, adjacent to SSI.
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input
Helpful side information, not standalone SSI.
Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits
Useful separation engineering, not silent speech.
Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI
Strong rtMRI recognition result, weak deployment story.
Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
The ultrasound-based x-vector speaker embedding is highly effective for speaker recognition, achieving under 1% error on unseen speakers, but its integration yields only a marginal improvement in multi-speaker ultrasound-to-speech synthesis accuracy.
An Improved Model for Voicing Silent Speech
This paper substantially improves open-vocabulary silent speech voicing using learned convolutional EMG features, Transformer modeling, and phoneme supervision, reducing WER from 68.0% to 42.2% automatic and 32.3% human in a single-speaker lab setting.
Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks
Preprocessing paper, narrow but legitimate.
Speaker disentanglement in video-to-speech conversion
The paper effectively makes speaker identity a controllable factor in multi-speaker video-to-speech synthesis by disentangling it from content, showing the trade-off between intelligibility and voice control on GRID corpus data.
Improving Neural Silent Speech Interface Models by Adversarial Training
A clean, well-executed incremental advance using GAN loss to modestly improve articulatory-to-acoustic mapping from ultrasound, validated objectively on two single-speaker corpora.
3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces
Temporal context helps, but the evidence is a single-speaker vocoder-parameter study.
HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation
Solid time-domain music vocal separation paper with a novel hybrid masking-denoising design showing improved silent-segment suppression; not relevant to SSI applications.
Silent versus modal multi-speaker speech recognition from ultrasound and video
Large-corpus baseline with real silent-mode gap.
EMA2S: An End-to-End Multimodal Articulatory-to-Speech System
EMA2S achieves consistent quality improvements over prior EMA-to-speech baselines by combining multimodal joint loss training with a neural vocoder, though gains remain confined to lab EMA conditions.
Convolutional Neural Network-Based Age Estimation Using B-Mode Ultrasound Tongue Image
Real signal, wrong target for SSI.
End-to-end Silent Speech Recognition with Acoustic Sensing
Strong mobile-friendly acoustic SSI paper.
Speech Prediction in Silent Videos using Variational Autoencoders
Strong video-to-speech paper that models ambiguity explicitly.
X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
Strong time-domain target-speaker extraction using speaker verification and innovative training; improves robustness to absent target but remains speech extraction, not silent speech.
Listening to Sounds of Silence for Speech Denoising
Strong denoising work, not SSI.
Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching
Technically solid self-supervised class-aware audiovisual sounding object localization, but outside the core SSI domain.
Digital Voicing of Silent Speech
Core EMG SSI paper with real gains from target transfer.
End-to-End Speaker-Dependent Voice Activity Detection
Strong target-speaker VAD paper, not SSI.
A comparison of oscillatory characteristics in covert speech and speech perception
Strong covert-speech EEG analysis, not an SSI system.
Silent Speech Interfaces for Speech Restoration: A Review
Core SSI survey with concrete deployment constraints.
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Strong AV speech survey, not an SSI system paper.
CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
Strong mobile speech-processing app paper, not SSI.
Foley Music: Learning to Generate Music from Videos
Strong video-to-music paper, not SSI.
Learning Frame Level Attention for Environmental Sound Classification
Strong ESC paper, but outside SSI.
Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images
Strong ultrasound SSI paper with unusually clear quantitative gains.
Application of Just-Noticeable Difference in Quality as Environment Suitability Test for Crowdsourcing Speech Quality Assessment Task
Strong crowdsourcing methodology paper, not SSI.
Vocoder-Based Speech Synthesis from Silent Videos
A notable step forward in lip-to-speech synthesis by predicting full vocoder features and jointly training for recognition, achieving strong speaker-dependent results but lacking unseen speaker generalization.
Continuous Silent Speech Recognition using EEG
Real EEG sentence-level silent speech recognition is demonstrated but at very high WER, confirming feasibility only and underscoring the immature state of current EEG silent speech technology.
Brain2Char: A Deep Architecture for Decoding Text from Brain Recordings
Brain2Char establishes a new state-of-the-art for continuous character decoding from invasive ECoG with competitive WER on large vocabularies and silent speech, demonstrating feasibility for communication BCIs.
Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed
This work delivers an improved waveform source separation model combined with a novel remix-based semi-supervised learning scheme using unlabeled music. Though not related to silent speech, it advances music separation benchmarks by closing gaps to spectrogram methods.
Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification
The proposed frame-level attention integrated within a convolutional recurrent network effectively improves environmental sound classification accuracy on ESC benchmarks by focusing on informative temporal frames while suppressing irrelevant or silent ones.
Lipper: Synthesizing Thy Speech using Multi-View Lipreading
Strong multi-view lip-to-speech baseline with honest quality limits.
Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder
The key advancement is continuous F0 tracking via CNNs yielding lower pitch error and slight naturalness improvement over discontinuous F0 pipelines in ultrasound SSI.
Video-Driven Speech Reconstruction using Generative Adversarial Networks
Foundational direct video-to-audio result with clear generalization limits.
A Novel Task-Oriented Text Corpus in Silent Speech Recognition and its Natural Language Generation Construction Method
Useful EEG-SSR corpus framing paper, but evidence is lighter than a full benchmark paper.
Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces
The paper advances ultrasound silent speech interfaces by compressing ultrasound images using an autoencoder bottleneck prior to spectral parameter prediction, resulting in improved accuracy and more natural synthesized speech with smaller models.
Denoising convolutional autoencoder based B-mode ultrasound tongue image feature extraction
DCAE provides cleaner, more robust ultrasound tongue features leading to improved silent speech recognition, outperforming prior feature extraction strategies.
All-neural online source separation, counting, and diarization for meeting analysis
Strong online diarization/separation paper, but outside SSI.
SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
A solid proof of concept that reconstructs speech audio from ultrasound for controlling unmodified smart speakers, showcasing important system design insight despite prototype limitations in latency, hardware bulk, and speaker dependency.
Audio Spectrogram Factorization for Classification of Telephony Signals below the Auditory Threshold
Strong telephony anti-SPAM paper, not SSI.
Proactive Security: Embedded AI Solution for Violent and Abusive Speech Recognition
An embedded smartphone NLP classifier detects violent speech with ~87.5% accuracy using known methods but is unrelated to silent speech interfaces; strong practical application in safety alerting.
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Multi-view silent video combined with CNN-LSTM models significantly improves speech audio reconstruction quality over single-view, highlighting the importance of optimal camera placement to address pose variance.
Visual-Only Recognition of Normal, Whispered and Silent Speech
Strong evidence that silent lipreading needs dedicated training.
Cross-modal Embeddings for Video and Audio Retrieval
Useful multimodal retrieval baseline, not SSI.
Lip2AudSpec: Speech reconstruction from silent lip movements video
The paper's auditory spectrogram autoencoder bottleneck target is a key innovation that produces more intelligible, natural reconstructed speech from lip videos than prior methods, as confirmed by objective and human evaluations.
Updating the silent speech challenge benchmark with deep learning
Benchmark update with a real, reproducible WER gain.
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Strong audiovisual speech separation and enhancement leveraging face video for speaker-dependent masking; not a silent speech interface paper.
Improved Speech Reconstruction from Silent Video
Strong, benchmark-setting speaker-dependent video-to-speech system that advances speech reconstruction from silent face video but remains limited to per-speaker training and constrained conditions.
Vid2speech: Speech Reconstruction from Silent Video
Real lip-to-speech progress, still tightly benchmark-bounded.
Contour-based 3d tongue motion visualization using ultrasound image sequences
Useful tongue-modeling tool, not a recognizer.
Optimal Power Control for Analog Bidirectional Relaying with Long-Term Relay Power Constraint
A rigorous relay power control theory paper optimizing outage under long-term average power constraints for bidirectional AF relaying; solid mathematical contribution but outside SSI relevance.
Approach comparison
Compare SilentSpeller, SottoVoce, and NasoVoce side by side without treating them as a required reading order.
Major silent speech approaches compared
SilentSpeller, SottoVoce, and NasoVoce compared on sensing, evaluation, practicality, and open questions.
Research agenda
The reviewed papers keep recurring on wearability, vocabulary, latency, and generalization. The page below turns that into a short, grounded agenda.
Open problems and research agenda
Wearability, open vocabulary, real-time use, and generalization keep reappearing in the current review set.
Technique taxonomy
These pages group the current database by real `modality:` tags from the expert records.
Video
42 reviewed pages · 0 imported pages
Acoustic
30 reviewed pages · 0 imported pages
Ultrasound
16 reviewed pages · 0 imported pages
Multimodal
15 reviewed pages · 0 imported pages
Microphone
7 reviewed pages · 0 imported pages
EMG
6 reviewed pages · 0 imported pages
EEG
5 reviewed pages · 0 imported pages
Magnetic
3 reviewed pages · 0 imported pages
Radar
2 reviewed pages · 0 imported pages
Vibration
2 reviewed pages · 0 imported pages
Camera
1 reviewed pages · 0 imported pages
Electropalatography
1 reviewed pages · 0 imported pages
Machine-readable exports
These files are generated from repository inputs during build.
SSI review export
Snapshot JSON of the current SSI review records built from repository inputs.
SSI review feed
Snapshot feed of the current SSI review records with source-updated timestamps.
Reference and citation
Use the canonical citation page when you need the database name, maintainer, or last-updated date.
How to cite this database
Canonical citation page for the SSI review database. Last updated 2026-06-09.
Datasets and code resources
Verified links are grouped on a dedicated page so the current corpus can point to code, datasets, and paper pages without inventing any new metadata.
Datasets and code resources
Verified links already present in repository data, with paper pages attached wherever the archive has a local review page.