Silent speech research 106 reviewed pages 0 imported corpus pages 102 citation-linked pages

Silent speech papers by publication data.

Browse the SSI review database by year, citation count, title, author, or page type.

Paper pages are expert evaluations, not abstract reposts. Citation counts come from OpenAlex when available.

Review rubric Compare approaches

Browse papers

Sort Author Page type

106 papers shown

arXiv reviewed unknown citations

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

The paper advances silent speech synthesis by leveraging masked training to robustly fuse electromyography and lipreading, showing improved performance and resilience, but adaptation to laryngectomized users remains challenging.

arXiv reviewed unknown citations

A 1000-hour EEG-EMG-audio dataset of Japanese speech production

Motoshige Sato, Ilya Horiguchi, Masakazu Inoue, Kenichi Tomeoka, Eri Hatakeyama, Yuya Kita, Atsushi Yamamoto, Ippei Fujisawa, Shuntaro Sasai

A 1020-hour multimodal EEG-EMG-audio dataset for Japanese overt speech vastly expands data resources, enabling diverse speech decoding and EEG research, though generalization is limited by three participants and no decoding benchmarks are presented.

arXiv reviewed unknown citations

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

Maryam Maghsoudi, Shihab Shamma

The study convincingly shows zero-shot imagined speech decoding by mapping MEG imagery to listened responses and decoding with a listened-trained contrastive model, marking a promising data-efficient advance despite limited vocabulary and hardware constraints.

CHI 2026 reviewed 1 citations

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

Jun Rekimoto, Yu Nishimura, Bojian Yang

A strong deployment-focused speech interface leveraging a novel nose-pad dual-sensor configuration and multimodal fusion to enable robust low-audibility speech interaction with AI under noise, backed by extensive evaluation.

arXiv reviewed unknown citations

A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

Introduces a cross-species pretrained transformer encoder enabling state-of-the-art end-to-end neural speech decoding with audio-LLMs, improving accuracy and enabling imagined speech decoding, but latency and real-time deployment remain challenges.

arXiv reviewed 0 citations

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.

arXiv reviewed 7 citations

IR-UWB Radar-Based Contactless Silent Speech Recognition of Vowels, Consonants, Words, and Phrases

Sunghwa Lee, Younghoon Shin, Myungjong Kim, Jiwon Seo

This paper introduces FERASEC, a novel radar feature extraction enabling the first contactless IR-UWB radar phoneme-level silent speech recognition with 86% vowel and 81% consonant accuracy, surpassing raw signal baselines and signifying a key advance in practical silent speech interfaces.

arXiv reviewed 67 citations

Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency

Chenyu Tang, Muzi Xu, Wentian Yi, Zibo Zhang, Edoardo Occhipinti, Chaoqun Dong, Dafydd Ravenscroft, Sung‐Min Jung, Sanghyo Lee, Shuo Gao, Jong Min Kim, Luigi G. Occhipinti

Strong SSI system combining a novel ultrasensitive throat textile strain sensor with an efficient 1D residual CNN, achieving high word classification accuracy with low computational cost and promising few-shot transfer to new users and words on small vocabularies.

arXiv reviewed 1 citations

Distributed pressure matching strategy using diffusion adaptation

Mengfei Zhang, Junqing Zhang, Jie Chen, Cédric Richard

Distributed rootless pressure matching for personal sound zones is presented and validated in simulation, not an SSI paper.

arXiv reviewed 5739 citations

Advancing Test-Time Adaptation for Acoustic Foundation Models in Open-World Shifts

Andy Clark

Strong acoustic ASR paper proposing confidence-weighted frame adaptation plus temporal consistency regularization for stable test-time adaptation under wild acoustic conditions, yielding substantial WER improvements across noise, accents, and singing datasets.

arXiv reviewed 19 citations

Sound Source Localization is All about Cross-Modal Alignment

Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

Provides a novel multi-positive contrastive framework enhancing semantic audio-visual alignment for sound source localization. Strong experimental evidence supports claims. Method is outside the SSI domain.

arXiv reviewed 4 citations

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

Strong lip-to-speech system that reduces ambiguity via SSL linguistic conditioning, variance predictors, and flow-based refinement, achieving near-vocoded naturalness and improved intelligibility on standard datasets.

arXiv reviewed 0 citations

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Matthew Martel, Jackson Wagner

Honest exploratory comparison showing transformer-based model outperforms deep-fusion CNN and Wavenet for generating low-to-mid frequency audio from silent video in a small curated dataset; not a speech or SSI paper.

arXiv reviewed 12 citations

Audio Knowledge Empowered Visual Speech Recognition

Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

The paper advances visual speech recognition by selectively transferring refined linguistic audio knowledge via a learned compact memory and cross-attention injection, improving benchmark WERs over prior audio-assisted methods without requiring audio inputs during inference.

arXiv reviewed 3 citations

Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Wenqiang Lai, Qihan Yang, Mao Ye, Endong Sun, Jiangnan Ye

This paper delivers a practical spelling-focused sEMG silent speech system by compressing a ResNet ensemble into a lightweight model achieving 85.9% accuracy on the NATO alphabet with portable hardware, but remains limited to 5 young male subjects and speaker-dependent scenarios.

arXiv reviewed 6 citations

Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data

Lionel Fontan, Typhanie Prince, Aleksandra Nowakowska, Halima Sahraoui, Silvia Martínez‐Ferreiro

Strong clinical fluency regression method validated on noisy read speech from aphasia patients; outside core SSI modalities and use-cases.

arXiv reviewed 7 citations

Exploring how a Generative AI interprets music

Gabriela Barenboim, Luigi Del Debbio, Johannes Hirn, Verónica Sanz

A thorough interpretability analysis reveals that MusicVAE uses only a few dozen latent dimensions to encode music with pitch and rhythm strongly represented in the first two, but the work has no direct relevance to silent speech interfaces.

arXiv reviewed 2 citations

Audio-visual video-to-speech synthesis with synthesized input audio

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.

arXiv reviewed 5 citations

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya Zhang

Strong AVS result, outside SSI: the useful idea is audio-conditioned decoder queries plus dynamic mask prediction.

arXiv reviewed 3 citations

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, Vineet Gandhi

Strong modular SSL-based lip-to-speech synthesis paper that innovatively maps lip SSL features to disentangled speech embeddings before vocoder synthesis, demonstrating improved intelligibility and robustness across benchmark datasets.

arXiv reviewed 7 citations

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

The real gain is not 'diffusion' alone but aligned conditioning plus guidance that pushes synchronization very hard.

arXiv reviewed 7 citations

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

Junchen Lu, Berrak Şişman, Mingyang Zhang, Haizhou Li

This video-conditioned AVO system innovatively supervises alignment by predicting discrete speech units rather than reconstructing acoustic features, leading to better lip-sync and speech quality on a single-speaker dataset; however, it is not an SSI interface paper.

arXiv reviewed 3 citations

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

Good decoder-transfer pretraining improves video-to-speech quality on several benchmarks, but WER gains are not consistent. A useful methodological contribution with strong benchmark support, adjacent to SSI rather than a deployable system.

arXiv reviewed 1 citations

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

Strong full-text paper demonstrating that inference-time text guidance via ASR classifier is key to significantly improved intelligibility in lip-to-speech synthesis on challenging in-the-wild video datasets, outperforming prior baselines.

arXiv reviewed 18 citations

Intelligible Lip-to-Speech Synthesis with Speech Units

Jeongsoo Choi, Minsu Kim, Yong Man Ro

Speech units as a pseudo-text target enable strong content supervision that substantially cuts WER without text labels, and the multi-input vocoder improves speech quality from blurry mel outputs, yielding a state-of-the-art lip-to-speech system on LRS benchmarks.

INTERSPEECH 2023 reviewed 5 citations

Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks

László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Tamás Gábor Csapó

Strong full-text-backed evidence that most of the gain comes from fast input alignment, not from inventing a new SSI stack.

arXiv reviewed 5 citations

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

Demonstrates effective zero-shot voice control in Lip2Speech by leveraging face image-based speaker embeddings, validated on GRID corpus but constrained by dataset vocabulary and speech naturalness.

arXiv reviewed 0 citations

Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

Sara Kashiwagi, Keitaro Tanaka, Feng Qi, Shigeo Morishima

Strong viseme-level metric learning approach reduces silent speech VSR errors on a small 10-phrase dataset, notably achieving parity with baselines using much less silent data.

arXiv reviewed 33 citations

Conditional Generation of Audio from Video via Foley Analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

The paper matters because it gives V2A generation a controllable exemplar, not because it beats every timing baseline.

arXiv reviewed 5 citations

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Strong SSI paper improving silent speech reconstruction by generating pseudo acoustic targets and using domain adversarial training to address domain mismatch; validated with TaL dataset showing substantial WER and MOS gains over TaLNet.

CHI 2023 reviewed 22 citations

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Jun Rekimoto

Strong whisper-conversion paper, but it remains whisper-based rather than truly silent SSI.

arXiv reviewed 5 citations

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

The paper presents a strong multi-speaker TTS phrasing approach leveraging speaker-conditioned BERT embeddings and pause duration categories to improve pause insertion precision and synthetic speech rhythm; however, it is out-of-scope for SSI as it focuses on audible speech synthesis only.

arXiv reviewed 35 citations

LipLearner: Customizable Silent Speech Interactions on Mobile Devices

Zixiong Su, Shitao Fang, Jun Rekimoto

LipLearner is a strong mobile silent speech system that uniquely closes the loop from few-shot lipreading model design to practical on-device customization and keyword spotting, demonstrated robustly in real-world conditions and a user study.

arXiv reviewed 1 citations

Towards Neural Decoding of Imagined Speech based on Spoken Speech

Seo‐Hyun Lee, Young-Eun Lee, Soo-Won Kim, Byung-Kwan Ko, Seong‐Whan Lee

Transfer of CSP+SVM models trained on spoken speech EEG to imagined speech achieves comparable, though slightly lower, accuracy within a limited 5-class, 7-subject offline EEG setup, with visual imagery control supporting specificity.

arXiv reviewed 1 citations

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

Hassan Taherian, Şefik Emre Eskimez, Takuya Yoshioka

Strong causal PSE paper, not SSI. The pVAD-guided loss is the part that holds up under full-text reading.

arXiv reviewed 2 citations

Movement Detection of Tongue and Related Body Parts Using IR-UWB Radar

Sunghwa Lee, Younghoon Shin

Good sensing primitive, very small task.

arXiv reviewed 14 citations

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C. V. Jawahar

The real contribution is not just another VAE-GAN; it is turning lip-to-speech into an arbitrary-speaker problem with credible low-data adaptation.

arXiv reviewed 2 citations

An Anchor-Free Detector for Continuous Speech Keyword Spotting

Zhiyuan Zhao, Chuanxin Tang, Chengdong Yao, Chong Luo

Strong CSKWS paper, not SSI. The detection framing and unknown class are the points that hold up in full text.

arXiv reviewed 9 citations

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao

This paper matters because it makes unconstrained lip-to-speech materially faster without obviously sacrificing quality.

arXiv reviewed 7 citations

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Amin Honarmandi Shandiz, László Tóth

An empirically supported, incremental advancement showing that hybrid 3D-CNN plus ConvLSTM models modestly outperform prior ultrasound tongue video SSI architectures in mel-spectrogram regression accuracy and model efficiency on single-speaker data.

arXiv reviewed 6 citations

VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

Joanna Hong, Minsu Kim, Yong Man Ro

The paper is really about disentangling identity, and that is why the unseen-speaker results hold up.

arXiv reviewed 6 citations

Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information

Chi-Luen Feng, Po‐Chun Hsu, Hung-yi Lee

Strong evidence that silence segments in HuBERT representations uniquely store speaker information, improving SID accuracy when silence is augmented; analytical SSL probing paper outside silent speech interface field.

arXiv reviewed 17 citations

SVTS: Scalable Video-to-Speech Synthesis

Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantić

A key scaling contribution that demonstrates simple spectrogram prediction plus pretrained vocoder pipelines outperform prior complex models on diverse datasets, marking foundational progress in large-scale video-to-speech synthesis.

arXiv reviewed 22 citations

Listen only to me! How well can target speech extraction handle false alarms?

Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Kateřina Žmolíková, Hiroshi Satō, Tomohiro Nakatani

Strong paper for false-alarm handling in TSE, wrong domain if someone tries to count it as SSI progress.

arXiv reviewed 46 citations

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

The key idea is not generic fusion; it is storing cross-modal correspondences so video-only decoding can recover some audio-side structure later.

arXiv reviewed 11 citations

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng

The real move is importing structure from voice conversion, not just adding another speaker embedding.

ICASSP 2022 reviewed 18 citations

Supervised and Self-supervised Pretraining Based COVID-19 Detection Using Acoustic Breathing/Cough/Speech Signals

Xingyu Chen, Qiushi Zhu, Jie Zhang, Li-Rong Dai

Sound classification paper, not SSI.

arXiv reviewed 16 citations

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Junchen Lu, Berrak Şişman, Rui Liu, Mingyang Zhang, Haizhou Li

VisualTTS effectively improves lip-speech synchronization in scripted voice over by conditioning TTS on lip video, but does not tackle silent speech decoding or unscripted scenarios.

arXiv reviewed 15 citations

Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language

Huiyan Li, Haohong Lin, You Wang, Hengyang Wang, Ming Zhang, Han Gao, Qing Ai, Zhiyuan Luo, Guang Li

SSRNet innovatively applies duration-aware Seq2Seq modeling and tonal multitask learning to reconstruct intelligible Mandarin speech from facial sEMG signals, markedly improving performance over prior methods but remains speaker-dependent with limited deployment evaluation.

CHI 2022 reviewed 43 citations

SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatography

Naoki Kimura, Tan Gemicioglu, Jonathan Womack, Richard Li, Yuhui Zhao, Abdelkareem Bedri, Zixiong Su, Alex Olwal, Jun Rekimoto, Thad Starner

SilentSpeller is a strong, rigorously tested SSI system that reframes silent speech as silent spelling, enabling large vocabulary, live text entry, and walking robustness with in-mouth electropalatography sensors.

arXiv reviewed 20 citations

SA-SDR: A novel loss function for separation of meeting style data

Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb‐Umbach

Elegant loss fix, not SSI.

arXiv reviewed 2 citations

Advances and Challenges in Deep Lip Reading

Marzieh Oghbaie, Arian Sabaghi, Kooshan Hashemifard, Mohammad Kazem Akbari

Good survey, not a model result.

arXiv reviewed 102 citations

Sub-word Level Lip Reading With Visual Attention

K R Prajwal, Triantafyllos Afouras, Andrew Zisserman

Major lip-reading gain, adjacent to SSI.

arXiv reviewed 2 citations

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Csapó Tamás Gábor, László Tóth, Gosztolya Gábor, Alexandra Markó

Helpful side information, not standalone SSI.

arXiv reviewed 7 citations

Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits

Qingjian Lin, Lin Yang, Xuyang Wang, Luyuan Xie, Jia Chen, Junjie Wang

Useful separation engineering, not silent speech.

arXiv reviewed 0 citations

Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Laxmi Pandey, Ahmed Sabbir Arif

Strong rtMRI recognition result, weak deployment story.

arXiv reviewed 5 citations

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Honarmandi Shandiz Amin, László Tóth, Gosztolya Gábor, Alexandra Markó, Csapó Tamás Gábor

The ultrasound-based x-vector speaker embedding is highly effective for speaker recognition, achieving under 1% error on unseen speakers, but its integration yields only a marginal improvement in multi-speaker ultrasound-to-speech synthesis accuracy.

arXiv reviewed 22 citations

An Improved Model for Voicing Silent Speech

David Gaddy, Dan Klein

This paper substantially improves open-vocabulary silent speech voicing using learned convolutional EMG features, Transformer modeling, and phoneme supervision, reducing WER from 68.0% to 42.2% automatic and 32.3% human in a single-speaker lab setting.

arXiv reviewed 2 citations

Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks

Amin Honarmandi Shandiz, László Tóth

Preprocessing paper, narrow but legitimate.

arXiv reviewed 8 citations

Speaker disentanglement in video-to-speech conversion

Dan Oneaţă, Adriana Stan, Horia Cucu

The paper effectively makes speaker identity a controllable factor in multi-speaker video-to-speech synthesis by disentangling it from content, showing the trade-off between intelligibility and voice control on GRID corpus data.

arXiv reviewed 12 citations

Improving Neural Silent Speech Interface Models by Adversarial Training

Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

A clean, well-executed incremental advance using GAN loss to modestly improve articulatory-to-acoustic mapping from ultrasound, validated objectively on two single-speaker corpora.

arXiv reviewed 12 citations

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

László Tóth, Amin Honarmandi Shandiz

Temporal context helps, but the evidence is a single-speaker vocoder-parameter study.

arXiv reviewed 1 citations

HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Solid time-domain music vocal separation paper with a novel hybrid masking-denoising design showing improved silent-segment suppression; not relevant to SSI applications.

arXiv reviewed 2 citations

Silent versus modal multi-speaker speech recognition from ultrasound and video

Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

Large-corpus baseline with real silent-mode gap.

arXiv reviewed 7 citations

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Yu‐Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan H. Sherman, Wen-Chin Huang, Xugang Lu, Yu Tsao

EMA2S achieves consistent quality improvements over prior EMA-to-speech baselines by combining multimodal joint loss training with a neural vocoder, though gains remain confined to lab EMA conditions.

arXiv reviewed 2 citations

Convolutional Neural Network-Based Age Estimation Using B-Mode Ultrasound Tongue Image

Kele Xu, Tamás Gábor Csapó, Ming Feng

Real signal, wrong target for SSI.

arXiv reviewed 12 citations

End-to-end Silent Speech Recognition with Acoustic Sensing

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

Strong mobile-friendly acoustic SSI paper.

arXiv reviewed 1 citations

Speech Prediction in Silent Videos using Variational Autoencoders

Ravindra Yadav, Ashish Sardana, Vinay P. Namboodiri, Rajesh M. Hegde

Strong video-to-speech paper that models ambiguity explicitly.

arXiv reviewed 31 citations

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

Zining Zhang, Bingsheng He, Zhenjie Zhang

Strong time-domain target-speaker extraction using speaker verification and innovative training; improves robustness to absent target but remains speech extraction, not silent speech.

arXiv reviewed 23 citations

Listening to Sounds of Silence for Speech Denoising

Ruilin Xu, Rundi Wu, Yuko Ishiwaka, Carl Vondrick, Changxi Zheng

Strong denoising work, not SSI.

arXiv reviewed 51 citations

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

Technically solid self-supervised class-aware audiovisual sounding object localization, but outside the core SSI domain.

arXiv reviewed 55 citations

Digital Voicing of Silent Speech

David Gaddy, Dan Klein

Core EMG SSI paper with real gains from target transfer.

arXiv reviewed 2 citations

End-to-End Speaker-Dependent Voice Activity Detection

Yefei Chen, Shuai Wang, Yanmin Qian, Kai Yu

Strong target-speaker VAD paper, not SSI.

arXiv reviewed 16 citations

A comparison of oscillatory characteristics in covert speech and speech perception

Jae Moon, Silvia Orlandi, Tom Chau

Strong covert-speech EEG analysis, not an SSI system.

arXiv reviewed 132 citations

Silent Speech Interfaces for Speech Restoration: A Review

José A. González, Alejandro Gomez-Alanis, Juan M. Martín-Doñas, José L. Pérez-Córdoba, Ángel M. Gómez

Core SSI survey with concrete deployment constraints.

arXiv reviewed 27 citations

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti, Zheng‐Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

Strong AV speech survey, not an SSI system paper.

arXiv reviewed 7 citations

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Kang, Ya‐Hsin Lai, Kai-Chun Liu, Szu‐Wei Fu, Syu‐Siang Wang, Yu Tsao

Strong mobile speech-processing app paper, not SSI.

arXiv reviewed 114 citations

Foley Music: Learning to Generate Music from Videos

Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

Strong video-to-music paper, not SSI.

arXiv reviewed 3 citations

Learning Frame Level Attention for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Shunqing Zhang, Tianhao Qiao, Shan Cao

Strong ESC paper, but outside SSI.

arXiv reviewed 3 citations

Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Pramit Saha, Yadong Liu, Bryan Gick, Sidney Fels

Strong ultrasound SSI paper with unusually clear quantitative gains.

arXiv reviewed 16 citations

Application of Just-Noticeable Difference in Quality as Environment Suitability Test for Crowdsourcing Speech Quality Assessment Task

Babak Naderi, Sebastian Möller

Strong crowdsourcing methodology paper, not SSI.

arXiv reviewed 1 citations

Vocoder-Based Speech Synthesis from Silent Videos

Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emília Gómez, Zheng‐Hua Tan, Jesper Jensen

A notable step forward in lip-to-speech synthesis by predicting full vocoder features and jointly training for recognition, achieving strong speaker-dependent results but lacking unseen speaker generalization.

arXiv reviewed 1 citations

Continuous Silent Speech Recognition using EEG

Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H. Tewfik

Real EEG sentence-level silent speech recognition is demonstrated but at very high WER, confirming feasibility only and underscoring the immature state of current EEG silent speech technology.

arXiv reviewed 81 citations

Brain2Char: A Deep Architecture for Decoding Text from Brain Recordings

Pengfei Sun, Gopala K. Anumanchipalli, Edward F. Chang

Brain2Char establishes a new state-of-the-art for continuous character decoding from invasive ECoG with competitive WER on large vocabularies and silent speech, demonstrating feasibility for communication BCIs.

arXiv reviewed 57 citations

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed

Alexandre Défossez, Nicolas Usunier, Léon Bottou, Francis R. Bach

This work delivers an improved waveform source separation model combined with a novel remix-based semi-supervised learning scheme using unlabeled music. Though not related to silent speech, it advances music separation benchmarks by closing gaps to spectrogram methods.

arXiv reviewed 125 citations

Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Shunqing Zhang, Tianhao Qiao, Shan Cao

The proposed frame-level attention integrated within a convolutional recurrent network effectively improves environmental sound classification accuracy on ESC benchmarks by focusing on informative temporal frames while suppressing irrelevant or silent ones.

arXiv reviewed 7 citations

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Yaman Kumar, Rohit Jain, Khwaja Mohd. Salik, Rajiv Ratn Shah, Yifang Yin, Roger Zimmermann

Strong multi-view lip-to-speech baseline with honest quality limits.

arXiv reviewed 5 citations

Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

Tamás Gábor Csapó, Mohammed Salah Al-Radhi, Géza Németh, Gábor Gosztolya, Tamás Grósz, László Tóth, Alexandra Markó

The key advancement is continuous F0 tracking via CNNs yielding lower pitch error and slight naturalness improvement over discontinuous F0 pipelines in ultrasound SSI.

arXiv reviewed 1 citations

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantić

Foundational direct video-to-audio result with clear generalization limits.

arXiv reviewed 1 citations

A Novel Task-Oriented Text Corpus in Silent Speech Recognition and its Natural Language Generation Construction Method

Dong Cao, Dongdong Zhang, Haibo Chen

Useful EEG-SSR corpus framing paper, but evidence is lighter than a full benchmark paper.

arXiv reviewed 3 citations

Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

Gábor Gosztolya, Ádám Pintér, László Tóth, Tamás Grósz, Alexandra Markó, Tamás Gábor Csapó

The paper advances ultrasound silent speech interfaces by compressing ultrasound images using an autoencoder bottleneck prior to spectral parameter prediction, resulting in improved accuracy and more natural synthesized speech with smaller models.

arXiv reviewed 27 citations

Denoising convolutional autoencoder based B-mode ultrasound tongue image feature extraction

Bo Li, Kele Xu, Dawei Feng, Haibo Mi, Huaimin Wang, Jian Zhu

DCAE provides cleaner, more robust ultrasound tongue features leading to improved silent speech recognition, outperforming prior feature extraction strategies.

arXiv reviewed 13 citations

All-neural online source separation, counting, and diarization for meeting analysis

Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb‐Umbach

Strong online diarization/separation paper, but outside SSI.

CHI 2019 reviewed 118 citations

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

Naoki Kimura, Michinari Kono, Jun Rekimoto

A solid proof of concept that reconstructs speech audio from ultrasound for controlling unmodified smart speakers, showcasing important system design insight despite prototype limitations in latency, hardware bulk, and speaker dependency.

arXiv reviewed 0 citations

Audio Spectrogram Factorization for Classification of Telephony Signals below the Auditory Threshold

Iroro Orife, Shane Walker, Jason Flaks

Strong telephony anti-SPAM paper, not SSI.

arXiv reviewed 1 citations

Proactive Security: Embedded AI Solution for Violent and Abusive Speech Recognition

Christopher Shulby, Leonardo Pombal, Vitor Jordão, Guilherme Ziolle, Bruno Martho, Antônio Postal, Thiago Prochnow

An embedded smartphone NLP classifier detects violent speech with ~87.5% accuracy using known methods but is unrelated to silent speech interfaces; strong practical application in safety alerting.

arXiv reviewed 16 citations

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Yaman Kumar, Mayank Aggarwal, Pratham Nawal, Shin'ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann

Multi-view silent video combined with CNN-LSTM models significantly improves speech audio reconstruction quality over single-view, highlighting the importance of optimal camera placement to address pose variance.

arXiv reviewed 1 citations

Visual-Only Recognition of Normal, Whispered and Silent Speech

Stavros Petridis, Jie Shen, Doruk Cetin, Maja Pantić

Strong evidence that silent lipreading needs dedicated training.

arXiv reviewed 43 citations

Cross-modal Embeddings for Video and Audio Retrieval

Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Giró Nieto, Xavier

Useful multimodal retrieval baseline, not SSI.

arXiv reviewed 9 citations

Lip2AudSpec: Speech reconstruction from silent lip movements video

Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

The paper's auditory spectrogram autoencoder bottleneck target is a key innovation that produces more intelligible, natural reconstructed speech from lip videos than prior methods, as confirmed by objective and human evaluations.

arXiv reviewed 59 citations

Updating the silent speech challenge benchmark with deep learning

Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, B. Denby

Benchmark update with a real, reproducible WER gain.

arXiv reviewed 6 citations

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Strong audiovisual speech separation and enhancement leveraging face video for speaker-dependent masking; not a silent speech interface paper.

arXiv reviewed 8 citations

Improved Speech Reconstruction from Silent Video

Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Strong, benchmark-setting speaker-dependent video-to-speech system that advances speech reconstruction from silent face video but remains limited to per-speaker training and constrained conditions.

arXiv reviewed 112 citations

Vid2speech: Speech Reconstruction from Silent Video

Ariel Ephrat, Shmuel Peleg

Real lip-to-speech progress, still tightly benchmark-bounded.

arXiv reviewed 6 citations

Contour-based 3d tongue motion visualization using ultrasound image sequences

Kele Xu, Yin Yang, Clémence Leboullenger, Pierre Roussel, B. Denby

Useful tongue-modeling tool, not a recognizer.

arXiv reviewed 1 citations

Optimal Power Control for Analog Bidirectional Relaying with Long-Term Relay Power Constraint

Zoran Hadzi-Velkov, Nikola Zlatanov, Robert Schober

A rigorous relay power control theory paper optimizing outage under long-term average power constraints for bidirectional AF relaying; solid mathematical contribution but outside SSI relevance.

Approach comparison

Compare SilentSpeller, SottoVoce, and NasoVoce side by side without treating them as a required reading order.

comparison3 reviewed pages

Major silent speech approaches compared

SilentSpeller, SottoVoce, and NasoVoce compared on sensing, evaluation, practicality, and open questions.

Research agenda

The reviewed papers keep recurring on wearability, vocabulary, latency, and generalization. The page below turns that into a short, grounded agenda.

agenda4 recurring gaps

Open problems and research agenda

Wearability, open vocabulary, real-time use, and generalization keep reappearing in the current review set.

Technique taxonomy

These pages group the current database by real `modality:` tags from the expert records.

modality:video42 pages

Video

42 reviewed pages · 0 imported pages

modality:acoustic30 pages

Acoustic

30 reviewed pages · 0 imported pages

modality:ultrasound16 pages

Ultrasound

16 reviewed pages · 0 imported pages

modality:multimodal15 pages

Multimodal

15 reviewed pages · 0 imported pages

modality:microphone7 pages

Microphone

7 reviewed pages · 0 imported pages

modality:emg6 pages

EMG

6 reviewed pages · 0 imported pages

modality:eeg5 pages

EEG

5 reviewed pages · 0 imported pages

modality:magnetic3 pages

Magnetic

3 reviewed pages · 0 imported pages

modality:radar2 pages

Radar

2 reviewed pages · 0 imported pages

modality:vibration2 pages

Vibration

2 reviewed pages · 0 imported pages

modality:camera1 pages

Camera

1 reviewed pages · 0 imported pages

modality:electropalatography1 pages

Electropalatography

1 reviewed pages · 0 imported pages

Machine-readable exports

These files are generated from repository inputs during build.

JSON exportmachine-readable

SSI review export

Snapshot JSON of the current SSI review records built from repository inputs.

Feed

JSON feedmachine-readable

SSI review feed

Snapshot feed of the current SSI review records with source-updated timestamps.

Export

Reference and citation

Use the canonical citation page when you need the database name, maintainer, or last-updated date.

referencecanonical citation

How to cite this database

Canonical citation page for the SSI review database. Last updated 2026-06-09.

Datasets and code resources

Verified links are grouped on a dedicated page so the current corpus can point to code, datasets, and paper pages without inventing any new metadata.

resources6 code-linked6 dataset-linked

Datasets and code resources

Verified links already present in repository data, with paper pages attached wherever the archive has a local review page.