← Home

Silent speech research 106 reviewed pages 0 imported corpus pages 102 citation-linked pages

Silent speech papers by publication data.

Browse the SSI review database by year, citation count, title, author, or page type.

Paper pages are expert evaluations, not abstract reposts. Citation counts come from OpenAlex when available.

Browse papers

106 papers shown

arXiv reviewed unknown citations

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

The paper advances silent speech synthesis by leveraging masked training to robustly fuse electromyography and lipreading, showing improved performance and resilience, but adaptation to laryngectomized users remains challenging.

arXiv reviewed unknown citations

A 1000-hour EEG-EMG-audio dataset of Japanese speech production

Motoshige Sato, Ilya Horiguchi, Masakazu Inoue, Kenichi Tomeoka, Eri Hatakeyama, Yuya Kita, Atsushi Yamamoto, Ippei Fujisawa, Shuntaro Sasai

A 1020-hour multimodal EEG-EMG-audio dataset for Japanese overt speech vastly expands data resources, enabling diverse speech decoding and EEG research, though generalization is limited by three participants and no decoding benchmarks are presented.

arXiv reviewed unknown citations

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

Maryam Maghsoudi, Shihab Shamma

The study convincingly shows zero-shot imagined speech decoding by mapping MEG imagery to listened responses and decoding with a listened-trained contrastive model, marking a promising data-efficient advance despite limited vocabulary and hardware constraints.

arXiv reviewed unknown citations

A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

Introduces a cross-species pretrained transformer encoder enabling state-of-the-art end-to-end neural speech decoding with audio-LLMs, improving accuracy and enabling imagined speech decoding, but latency and real-time deployment remain challenges.

arXiv reviewed 0 citations

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

A high-quality video-to-audio generation framework leveraging vision-language models for editable, temporally precise sound effect generation; strong experimental validations but outside standard SSI scope.

arXiv reviewed 67 citations

Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency

Chenyu Tang, Muzi Xu, Wentian Yi, Zibo Zhang, Edoardo Occhipinti, Chaoqun Dong, Dafydd Ravenscroft, Sung‐Min Jung, Sanghyo Lee, Shuo Gao, Jong Min Kim, Luigi G. Occhipinti

Strong SSI system combining a novel ultrasensitive throat textile strain sensor with an efficient 1D residual CNN, achieving high word classification accuracy with low computational cost and promising few-shot transfer to new users and words on small vocabularies.

arXiv reviewed 19 citations

Sound Source Localization is All about Cross-Modal Alignment

Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

Provides a novel multi-positive contrastive framework enhancing semantic audio-visual alignment for sound source localization. Strong experimental evidence supports claims. Method is outside the SSI domain.

arXiv reviewed 12 citations

Audio Knowledge Empowered Visual Speech Recognition

Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

The paper advances visual speech recognition by selectively transferring refined linguistic audio knowledge via a learned compact memory and cross-attention injection, improving benchmark WERs over prior audio-assisted methods without requiring audio inputs during inference.

arXiv reviewed 3 citations

Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Wenqiang Lai, Qihan Yang, Mao Ye, Endong Sun, Jiangnan Ye

This paper delivers a practical spelling-focused sEMG silent speech system by compressing a ResNet ensemble into a lightweight model achieving 85.9% accuracy on the NATO alphabet with portable hardware, but remains limited to 5 young male subjects and speaker-dependent scenarios.

arXiv reviewed 7 citations

Exploring how a Generative AI interprets music

Gabriela Barenboim, Luigi Del Debbio, Johannes Hirn, Verónica Sanz

A thorough interpretability analysis reveals that MusicVAE uses only a few dozen latent dimensions to encode music with pitch and rhythm strongly represented in the first two, but the work has no direct relevance to silent speech interfaces.

arXiv reviewed 2 citations

Audio-visual video-to-speech synthesis with synthesized input audio

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

The paper credibly shows that incorporating synthesized audio as an auxiliary input in a second-stage audiovisual synthesis model improves video-to-speech reconstruction quality and intelligibility in benchmarks, though gains depend on model variant and dataset.

arXiv reviewed 3 citations

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Triantafyllos Kefalas, Yannis Panagakis, Maja Pantić

Good decoder-transfer pretraining improves video-to-speech quality on several benchmarks, but WER gains are not consistent. A useful methodological contribution with strong benchmark support, adjacent to SSI rather than a deployable system.

arXiv reviewed 1 citations

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, Ethan Fetaya

Strong full-text paper demonstrating that inference-time text guidance via ASR classifier is key to significantly improved intelligibility in lip-to-speech synthesis on challenging in-the-wild video datasets, outperforming prior baselines.

arXiv reviewed 18 citations

Intelligible Lip-to-Speech Synthesis with Speech Units

Jeongsoo Choi, Minsu Kim, Yong Man Ro

Speech units as a pseudo-text target enable strong content supervision that substantially cuts WER without text labels, and the multi-input vocoder improves speech quality from blurry mel outputs, yielding a state-of-the-art lip-to-speech system on LRS benchmarks.

arXiv reviewed 5 citations

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

The paper presents a strong multi-speaker TTS phrasing approach leveraging speaker-conditioned BERT embeddings and pause duration categories to improve pause insertion precision and synthetic speech rhythm; however, it is out-of-scope for SSI as it focuses on audible speech synthesis only.

arXiv reviewed 35 citations

LipLearner: Customizable Silent Speech Interactions on Mobile Devices

Zixiong Su, Shitao Fang, Jun Rekimoto

LipLearner is a strong mobile silent speech system that uniquely closes the loop from few-shot lipreading model design to practical on-device customization and keyword spotting, demonstrated robustly in real-world conditions and a user study.

arXiv reviewed 1 citations

Towards Neural Decoding of Imagined Speech based on Spoken Speech

Seo‐Hyun Lee, Young-Eun Lee, Soo-Won Kim, Byung-Kwan Ko, Seong‐Whan Lee

Transfer of CSP+SVM models trained on spoken speech EEG to imagined speech achieves comparable, though slightly lower, accuracy within a limited 5-class, 7-subject offline EEG setup, with visual imagery control supporting specificity.

arXiv reviewed 14 citations

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C. V. Jawahar

The real contribution is not just another VAE-GAN; it is turning lip-to-speech into an arbitrary-speaker problem with credible low-data adaptation.

arXiv reviewed 17 citations

SVTS: Scalable Video-to-Speech Synthesis

Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantić

A key scaling contribution that demonstrates simple spectrogram prediction plus pretrained vocoder pipelines outperform prior complex models on diverse datasets, marking foundational progress in large-scale video-to-speech synthesis.

arXiv reviewed 15 citations

Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language

Huiyan Li, Haohong Lin, You Wang, Hengyang Wang, Ming Zhang, Han Gao, Qing Ai, Zhiyuan Luo, Guang Li

SSRNet innovatively applies duration-aware Seq2Seq modeling and tonal multitask learning to reconstruct intelligible Mandarin speech from facial sEMG signals, markedly improving performance over prior methods but remains speaker-dependent with limited deployment evaluation.

CHI 2022 reviewed 43 citations

SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatography

Naoki Kimura, Tan Gemicioglu, Jonathan Womack, Richard Li, Yuhui Zhao, Abdelkareem Bedri, Zixiong Su, Alex Olwal, Jun Rekimoto, Thad Starner

SilentSpeller is a strong, rigorously tested SSI system that reframes silent speech as silent spelling, enabling large vocabulary, live text entry, and walking robustness with in-mouth electropalatography sensors.

arXiv reviewed 5 citations

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Honarmandi Shandiz Amin, László Tóth, Gosztolya Gábor, Alexandra Markó, Csapó Tamás Gábor

The ultrasound-based x-vector speaker embedding is highly effective for speaker recognition, achieving under 1% error on unseen speakers, but its integration yields only a marginal improvement in multi-speaker ultrasound-to-speech synthesis accuracy.

arXiv reviewed 22 citations

An Improved Model for Voicing Silent Speech

David Gaddy, Dan Klein

This paper substantially improves open-vocabulary silent speech voicing using learned convolutional EMG features, Transformer modeling, and phoneme supervision, reducing WER from 68.0% to 42.2% automatic and 32.3% human in a single-speaker lab setting.

arXiv reviewed 8 citations

Speaker disentanglement in video-to-speech conversion

Dan Oneaţă, Adriana Stan, Horia Cucu

The paper effectively makes speaker identity a controllable factor in multi-speaker video-to-speech synthesis by disentangling it from content, showing the trade-off between intelligibility and voice control on GRID corpus data.

arXiv reviewed 12 citations

Improving Neural Silent Speech Interface Models by Adversarial Training

Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

A clean, well-executed incremental advance using GAN loss to modestly improve articulatory-to-acoustic mapping from ultrasound, validated objectively on two single-speaker corpora.

arXiv reviewed 7 citations

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Yu‐Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan H. Sherman, Wen-Chin Huang, Xugang Lu, Yu Tsao

EMA2S achieves consistent quality improvements over prior EMA-to-speech baselines by combining multimodal joint loss training with a neural vocoder, though gains remain confined to lab EMA conditions.

arXiv reviewed 1 citations

Vocoder-Based Speech Synthesis from Silent Videos

Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emília Gómez, Zheng‐Hua Tan, Jesper Jensen

A notable step forward in lip-to-speech synthesis by predicting full vocoder features and jointly training for recognition, achieving strong speaker-dependent results but lacking unseen speaker generalization.

arXiv reviewed 1 citations

Continuous Silent Speech Recognition using EEG

Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H. Tewfik

Real EEG sentence-level silent speech recognition is demonstrated but at very high WER, confirming feasibility only and underscoring the immature state of current EEG silent speech technology.

arXiv reviewed 81 citations

Brain2Char: A Deep Architecture for Decoding Text from Brain Recordings

Pengfei Sun, Gopala K. Anumanchipalli, Edward F. Chang

Brain2Char establishes a new state-of-the-art for continuous character decoding from invasive ECoG with competitive WER on large vocabularies and silent speech, demonstrating feasibility for communication BCIs.

arXiv reviewed 57 citations

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed

Alexandre Défossez, Nicolas Usunier, Léon Bottou, Francis R. Bach

This work delivers an improved waveform source separation model combined with a novel remix-based semi-supervised learning scheme using unlabeled music. Though not related to silent speech, it advances music separation benchmarks by closing gaps to spectrogram methods.

arXiv reviewed 5 citations

Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

Tamás Gábor Csapó, Mohammed Salah Al-Radhi, Géza Németh, Gábor Gosztolya, Tamás Grósz, László Tóth, Alexandra Markó

The key advancement is continuous F0 tracking via CNNs yielding lower pitch error and slight naturalness improvement over discontinuous F0 pipelines in ultrasound SSI.

arXiv reviewed 3 citations

Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

Gábor Gosztolya, Ádám Pintér, László Tóth, Tamás Grósz, Alexandra Markó, Tamás Gábor Csapó

The paper advances ultrasound silent speech interfaces by compressing ultrasound images using an autoencoder bottleneck prior to spectral parameter prediction, resulting in improved accuracy and more natural synthesized speech with smaller models.

arXiv reviewed 1 citations

Proactive Security: Embedded AI Solution for Violent and Abusive Speech Recognition

Christopher Shulby, Leonardo Pombal, Vitor Jordão, Guilherme Ziolle, Bruno Martho, Antônio Postal, Thiago Prochnow

An embedded smartphone NLP classifier detects violent speech with ~87.5% accuracy using known methods but is unrelated to silent speech interfaces; strong practical application in safety alerting.

arXiv reviewed 16 citations

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Yaman Kumar, Mayank Aggarwal, Pratham Nawal, Shin'ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann

Multi-view silent video combined with CNN-LSTM models significantly improves speech audio reconstruction quality over single-view, highlighting the importance of optimal camera placement to address pose variance.

arXiv reviewed 9 citations

Lip2AudSpec: Speech reconstruction from silent lip movements video

Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

The paper's auditory spectrogram autoencoder bottleneck target is a key innovation that produces more intelligible, natural reconstructed speech from lip videos than prior methods, as confirmed by objective and human evaluations.

arXiv reviewed 8 citations

Improved Speech Reconstruction from Silent Video

Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Strong, benchmark-setting speaker-dependent video-to-speech system that advances speech reconstruction from silent face video but remains limited to per-speaker training and constrained conditions.

Approach comparison

Compare SilentSpeller, SottoVoce, and NasoVoce side by side without treating them as a required reading order.

Research agenda

The reviewed papers keep recurring on wearability, vocabulary, latency, and generalization. The page below turns that into a short, grounded agenda.

agenda4 recurring gaps

Open problems and research agenda

Wearability, open vocabulary, real-time use, and generalization keep reappearing in the current review set.

Technique taxonomy

These pages group the current database by real `modality:` tags from the expert records.

modality:video42 pages

Video

42 reviewed pages · 0 imported pages

modality:acoustic30 pages

Acoustic

30 reviewed pages · 0 imported pages

modality:ultrasound16 pages

Ultrasound

16 reviewed pages · 0 imported pages

modality:multimodal15 pages

Multimodal

15 reviewed pages · 0 imported pages

modality:microphone7 pages

Microphone

7 reviewed pages · 0 imported pages

modality:emg6 pages

EMG

6 reviewed pages · 0 imported pages

modality:eeg5 pages

EEG

5 reviewed pages · 0 imported pages

modality:magnetic3 pages

Magnetic

3 reviewed pages · 0 imported pages

modality:radar2 pages

Radar

2 reviewed pages · 0 imported pages

modality:vibration2 pages

Vibration

2 reviewed pages · 0 imported pages

modality:camera1 pages

Camera

1 reviewed pages · 0 imported pages

Machine-readable exports

These files are generated from repository inputs during build.

JSON exportmachine-readable

SSI review export

Snapshot JSON of the current SSI review records built from repository inputs.

JSON feedmachine-readable

SSI review feed

Snapshot feed of the current SSI review records with source-updated timestamps.

Reference and citation

Use the canonical citation page when you need the database name, maintainer, or last-updated date.

referencecanonical citation

How to cite this database

Canonical citation page for the SSI review database. Last updated 2026-06-09.

Datasets and code resources

Verified links are grouped on a dedicated page so the current corpus can point to code, datasets, and paper pages without inventing any new metadata.

resources6 code-linked6 dataset-linked

Datasets and code resources

Verified links already present in repository data, with paper pages attached wherever the archive has a local review page.