2026 · arXiv · Field expert review · confidence high

A 1000-hour EEG-EMG-audio dataset of Japanese speech production

Motoshige Sato, Ilya Horiguchi, Masakazu Inoue, Kenichi Tomeoka, Eri Hatakeyama, Yuya Kita, Atsushi Yamamoto, Ippei Fujisawa, Shuntaro Sasai

arXiv

A 1020-hour multimodal EEG-EMG-audio dataset for Japanese overt speech vastly expands data resources, enabling diverse speech decoding and EEG research, though generalization is limited by three participants and no decoding benchmarks are presented.

Verdict: full-text draftPriority: highConfidence: highBasis: full text + summaryCoverage: high

Reading guidance

Verdict: full-text draft · priority high · confidence high
Why it matters: A large multimodal dataset of high-quality EEG, EMG, and audio for overt Japanese speech with strong technical validation, advancing data availability for non-invasive speech decoding research.
What to trust: Basis: full text + summary. Coverage: high. 5 evidence records back the review.
What is weak: Small sample size (n=3) limits generalization; lack of downstream decoding evaluations; longitudinal recordings from few participants only; no testing of wearable or mobile suitability reported. Evaluation is limited to basic signal quality and physiological validation analyses, not downstream decoding performance or benchmarking. The dataset itself is not a deployed system; no deployment readiness or specific application deployment is described. Data collected only from three Japanese male participants limits participant diversity; evaluation limited to signal quality checks without decoding experiments. Overclaim risk: low.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: dataset
Modality: multimodal
Hardware: g.Pangolin (128 ch), g.SCARABEO (62 ch), eego™sports (63 ch); scalp EEG caps with up to 128 channels plus facial EMG electrodes (3 channels) configured around lips and eyes; lavalier microphone for audio recording.
Body site: brain
Output: audio
Vocabulary: open vocabulary
Evaluation mode: quantitative
Review confidence: high
Overclaim risk: low

Expert take

This paper releases the JapanEEG dataset, a large-scale multimodal collection of 1020 hours of scalp EEG, facial EMG, and audio recorded during open-vocabulary overt Japanese speech from three participants using three different EEG systems. The dataset substantially exceeds prior public speech EEG datasets in scale and diversity of recording hardware. Rigorous technical validation demonstrates expected physiological EEG spectral profiles and event-related potentials across devices and participants, confirming high data quality. The work's principal contribution is the dataset itself, positioned to enable diverse speech decoding studies and to support broader EEG research on artifact modeling, representation learning, and cross-session/device adaptation. Limitations include a small participant pool limiting generalization and the lack of downstream decoding benchmarks. Overall, this dataset fills a notable gap in publicly available speech EEG resources, particularly for overt speech in Japanese, and offers valuable opportunities for the SSI community to develop and evaluate decoding approaches with large longitudinal multimodal EEG data.

True value

A large multimodal dataset of high-quality EEG, EMG, and audio for overt Japanese speech with strong technical validation, advancing data availability for non-invasive speech decoding research.

What changed

Canon before

Prior public EEG datasets for speech decoding were smaller in scale (hours from single- or few-channel systems), often limited to imagined speech or single devices, and predominantly in languages other than Japanese.

Delta from canon

Significantly larger scale and multimodal dataset with cross-device, longitudinal recordings in Japanese overt speech, supporting broader research areas beyond speech decoding alone.

Position in field

significant dataset contribution expanding scale and modality diversity for speech EEG decoding in Japanese overt speech

Evidence

“ We present a multimodal dataset of 1020 hours of simultaneously recorded scalp electroencephalography (EEG), facial electromyography (EMG), and speech audio from three healthy native Japanese speakers during open-vocabulary overt speech. ”

author_claim · Abstract · confidence 1.00

“ Three scalp EEG systems were employed across the dataset: g.Pangolin (128 channels), g.SCARABEO (62 channels), and eego™sports (63 channels). Facial EMG was recorded simultaneously with three bipolar channels placed on the upper lip, lower lip, and eye regions. ”

fact · Methods section · confidence 1.00

“ The dataset is publicly available via OpenNeuro in Brain Imaging Data Structure (BIDS) format under a CC0 waiver with approximately 955 GB in size. ”

fact · Data Records · confidence 1.00

“ Technical validation comprised power spectral density and event-related potential analyses across participants, devices, and tasks, showing the expected 1/f spectral profile, task-related alpha-band attenuation, and time-locked evoked responses consistent with speech-related cortical activity. ”

validation_scope · Technical Validation section · confidence 1.00

“ The dataset involves only three participants, limiting generalizability across larger populations. ”

limitation · Methods section · confidence 1.00

Limits

Technical limits

Small sample size (n=3) limits generalization; lack of downstream decoding evaluations; longitudinal recordings from few participants only; no testing of wearable or mobile suitability reported.

Evaluation limits

Evaluation is limited to basic signal quality and physiological validation analyses, not downstream decoding performance or benchmarking.

Deployment limits

The dataset itself is not a deployed system; no deployment readiness or specific application deployment is described.

Scope limits

Data collected only from three Japanese male participants limits participant diversity; evaluation limited to signal quality checks without decoding experiments.