2022 · arXiv / imported corpus page · Field expert review · confidence high

Listen only to me! How well can target speech extraction handle false alarms?

Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Kateřina Žmolíková, Hiroshi Satō, Tomohiro Nakatani

arXiv

Strong paper for false-alarm handling in TSE, wrong domain if someone tries to count it as SSI progress.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: As a TSE paper, the full text is strong: the verification route keeps extraction quality closer to baseline while materially reducing false alarms, but it is still not an SSI paper.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: Everything is benchmarked on LibriMix mixtures; there is no silent-speech modality or multimodal cue. The evidence is specific to the SpeakerBeam family and the LibriMix training recipe reported here. TSE-V needs an extra verification stage and enrollment audio, and none of it addresses SSI interaction. Target speech extraction with inactive-speaker handling only. Overclaim risk: Any claim that this is an SSI advance would be unsupported by the full text..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: speech-enhancement
Modality: speech mixture plus enrollment speech
Output: speech-audio
Metrics: SDRi before/after detection; fail rate; EER; fail-and-miss rate; attenuation; enrollment-duration curves
Evaluation mode: LibriMix extraction and active/inactive detection study using SDRi, fail rate, EER, and enrollment-duration sweeps
Review confidence: high
Overclaim risk: Any claim that this is an SSI advance would be unsupported by the full text.

Expert take

Table 2 shows the tradeoff clearly. TSE-IS can directly output zeros, but at 10-second enrollment it drops to 10.8 dB SDRi before detection, 8.6% failure, 11.6% EER, and 13.4% fail-and-miss. TSE-V(360) is the better operating point: 13.6 dB SDRi before detection, 1.7% failure, 6.3% EER, and 7.1% fail-and-miss. Figure 4 sharpens the practical lesson: longer enrollment helps TSE-V approach roughly 5% EER around 15 to 20 seconds. That is a credible deployment result for TSE, but it has no silent-speech sensing or articulatory interface component.

True value

As a TSE paper, the full text is strong: the verification route keeps extraction quality closer to baseline while materially reducing false alarms, but it is still not an SSI paper.

What changed

Canon before

Target speech extraction papers usually assumed the enrolled speaker was always active, which hides false alarms at deployment time.

Delta from canon

This paper makes inactive-speaker failure a first-class evaluation target and shows that verification-based handling is stronger than direct zero-output training.

Position in field

Useful deployment-failure analysis for target speech extraction, outside the core SSI modality set.

Evidence

“ Overall systems in terms of extraction performance and have mostly ig- TSE-V achieves higher extraction and detection performance nored the impact of false alarms when the target speaker is in- than TSE-IS. active. ”

author_claim · Abstract · confidence 0.99

“ At test time, we re- tween the embeddings computed from the enrollment and from used the auxiliary NN to compute the embedding vector for the the extracted speech as, extracted speech and performed AS/IS detection with Eq. (8). s TSE-V(360) consists of the TSE module of the above TSE-V 1, if C(ex̂ , es ) > η Cos , cCos = s (8) system retrained on AS samples of the train-360k dataset for 0, if C(ex̂ , es ) ≤ η Cos , 100 epochs. ”

validation_scope · Table 1: Description of the dataset · confidence 0.98

“ Experimental settings Table 2 shows the extraction and AS/IS detection results for the We used the same NN architecture for all experiments, which different systems using enrollment of average duration of 10 consists of the SpeakerBeam system provided in [29], except sec. ”

metric · Table 2: Extraction and detection performance with enrollment of average duration of 10 sec. · confidence 0.99

“ 2, the attenuation values remain in a Figure 4: Extraction and AS/IS detection performance as a similar range for AS and IS samples, meaning that it always out- function of the enrollment duration. puts some signal even for IS cases, causing many false alarms. ”

limitation · Figure 4: Extraction and AS/IS detection performance as a function of the enrollment duration. · confidence 0.97

Limits

Technical limits

Everything is benchmarked on LibriMix mixtures; there is no silent-speech modality or multimodal cue.

Evaluation limits

The evidence is specific to the SpeakerBeam family and the LibriMix training recipe reported here.

Deployment limits

TSE-V needs an extra verification stage and enrollment audio, and none of it addresses SSI interaction.

Scope limits

Target speech extraction with inactive-speaker handling only.