2019 · arXiv / imported corpus page · Field expert review · confidence high

All-neural online source separation, counting, and diarization for meeting analysis

Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb‐Umbach

arXiv

Strong online diarization/separation paper, but outside SSI.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: Tracking speakers through silent blocks in a single neural online system is the real contribution.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: Outside SSI, simulated meeting mixtures, and still well below the ideal-ratio-mask upper bound. Meeting-analysis benchmarks only, with no SSI use case or human-interaction deployment. It targets single-channel meeting analysis rather than silent-speech interaction, and results are on simulated meeting mixtures. Single-channel online meeting analysis with simulated mixtures. Overclaim risk: low.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: online source separation and diarization
Modality: single-channel meeting audio
Output: speech-audio
Metrics: In the 12-block conversation-like condition, proposed model (2) reports SDR 11.7 dB, DER 6.6%, and SCER 4.9%; source counting exceeds 98% and is above 99% in most other conditions.
Evaluation mode: SDR, DER, SCER, and source-counting accuracy across block-online meeting scenarios
Review confidence: high
Overclaim risk: low

Expert take

This is not an SSI paper, but it is technically solid for online meeting analysis. The key idea is the block-online neural estimator that keeps speaker identity stable through silent blocks while adapting the number of output masks. The reported results are strong relative to online baselines: in the 12-block conversation-like setup, the gated proposed model reaches 11.7 dB SDR, 6.6% DER, and 4.9% SCER, with source counting above 98%. Keep it marked as adjacent diarization/separation, not silent speech.

True value

Tracking speakers through silent blocks in a single neural online system is the real contribution.

What changed

Canon before

Meeting diarization pipelines often stitched together separation and clustering and handled long silent gaps poorly in online mode.

Delta from canon

The paper unifies separation, source counting, and diarization in one block-online neural estimator.

Position in field

Relevant adjacent work for online separation/diarization, not for silent-speech interaction.

Evidence

“ Recently, many promising neural network Here, we also consider separation and diarization jointly, how- (NN)-based single-channel approaches have been proposed to solve ever proposing a novel all-neural block-online approach that per- the problem of source separation, such as Deep Clustering (DC) [4], forms source separation, source number counting and diarization Deep Attractor Network (DAN) [5] and Permutation Invariant Train- all together. ”

author_claim · ABSTRACT · confidence 0.94

“ It comprises the tasks: the same output node, even if he/she remains silent for some time. (a) diarization, i.e., determining who is speaking when, (b) source Most conventional meeting diarization approaches perform counting, i.e., estimating the number of speakers in a meeting, (c) block-offline or block-online processing by carrying out the fol- separating overlapped speech, i.e., carrying out (blind) source sep- lowing two steps sequentially [1, 10–13]. ”

actual_novelty · 1. INTRODUCTION · confidence 0.90

“ SDR improvement, speaker diarization and speaker confusion error rates (a) 4-block (b) 12-block homogeneous (c) 12-block conv.-like Model SDR DER SCER SDR DER SCER SDR DER SCER spk loss gate [dB] [%] [%] [dB] [%] [%] [dB] [%] [%] 1 — — 19.4 4.2 3.1 7.5 5.5 5.3 11.5 7.8 6.5 ”

metric · 3.4. Results · confidence 0.96

“ ABSTRACT vectors are clustered to obtain masks, from which the sources can be Automatic meeting analysis comprises the tasks of speaker recovered by applying the masks to the speech mixture. ”

limitation · 5. CONCLUSIONS · confidence 0.89

Limits

Technical limits

Outside SSI, simulated meeting mixtures, and still well below the ideal-ratio-mask upper bound.

Evaluation limits

Meeting-analysis benchmarks only, with no SSI use case or human-interaction deployment.

Deployment limits

It targets single-channel meeting analysis rather than silent-speech interaction, and results are on simulated meeting mixtures.

Scope limits

Single-channel online meeting analysis with simulated mixtures.