← SSI archive · Review rubric

2019 · arXiv / imported corpus page · Field expert review · confidence high

All-neural online source separation, counting, and diarization for meeting analysis

Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb‐Umbach

Strong online diarization/separation paper, but outside SSI.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict
full-text draft · priority medium · confidence high
Why it matters
Tracking speakers through silent blocks in a single neural online system is the real contribution.
What to trust
Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak
Outside SSI, simulated meeting mixtures, and still well below the ideal-ratio-mask upper bound. Meeting-analysis benchmarks only, with no SSI use case or human-interaction deployment. It targets single-channel meeting analysis rather than silent-speech interaction, and results are on simulated meeting mixtures. Single-channel online meeting analysis with simulated mixtures. Overclaim risk: low.
Read before
SSI review rubric
Read next
SSI archive

Axes

Task
online source separation and diarization
Modality
single-channel meeting audio
Output
speech-audio
Metrics
In the 12-block conversation-like condition, proposed model (2) reports SDR 11.7 dB, DER 6.6%, and SCER 4.9%; source counting exceeds 98% and is above 99% in most other conditions.
Evaluation mode
SDR, DER, SCER, and source-counting accuracy across block-online meeting scenarios
Review confidence
high
Overclaim risk
low

Expert take

This is not an SSI paper, but it is technically solid for online meeting analysis. The key idea is the block-online neural estimator that keeps speaker identity stable through silent blocks while adapting the number of output masks. The reported results are strong relative to online baselines: in the 12-block conversation-like setup, the gated proposed model reaches 11.7 dB SDR, 6.6% DER, and 4.9% SCER, with source counting above 98%. Keep it marked as adjacent diarization/separation, not silent speech.

True value

Tracking speakers through silent blocks in a single neural online system is the real contribution.

What changed

Canon before

Meeting diarization pipelines often stitched together separation and clustering and handled long silent gaps poorly in online mode.

Delta from canon

The paper unifies separation, source counting, and diarization in one block-online neural estimator.

Position in field

Relevant adjacent work for online separation/diarization, not for silent-speech interaction.

Evidence

“ Recently, many promising neural network Here, we also consider separation and diarization jointly, how- (NN)-based single-channel approaches have been proposed to solve ever proposing a novel all-neural block-online approach that per- the problem of source separation, such as Deep Clustering (DC) [4], forms source separation, source number counting and diarization Deep Attractor Network (DAN) [5] and Permutation Invariant Train- all together. ”

author_claim · ABSTRACT · confidence 0.94

“ It comprises the tasks: the same output node, even if he/she remains silent for some time. (a) diarization, i.e., determining who is speaking when, (b) source Most conventional meeting diarization approaches perform counting, i.e., estimating the number of speakers in a meeting, (c) block-offline or block-online processing by carrying out the fol- separating overlapped speech, i.e., carrying out (blind) source sep- lowing two steps sequentially [1, 10–13]. ”

actual_novelty · 1. INTRODUCTION · confidence 0.90

“ SDR improvement, speaker diarization and speaker confusion error rates (a) 4-block (b) 12-block homogeneous (c) 12-block conv.-like Model SDR DER SCER SDR DER SCER SDR DER SCER spk loss gate [dB] [%] [%] [dB] [%] [%] [dB] [%] [%] 1 — — 19.4 4.2 3.1 7.5 5.5 5.3 11.5 7.8 6.5 ”

metric · 3.4. Results · confidence 0.96

“ ABSTRACT vectors are clustered to obtain masks, from which the sources can be Automatic meeting analysis comprises the tasks of speaker recovered by applying the masks to the speech mixture. ”

limitation · 5. CONCLUSIONS · confidence 0.89

Limits

Technical limits

Outside SSI, simulated meeting mixtures, and still well below the ideal-ratio-mask upper bound.

Evaluation limits

Meeting-analysis benchmarks only, with no SSI use case or human-interaction deployment.

Deployment limits

It targets single-channel meeting analysis rather than silent-speech interaction, and results are on simulated meeting mixtures.

Scope limits

Single-channel online meeting analysis with simulated mixtures.