2022 · arXiv / imported corpus page · Field expert review · confidence high

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

Hassan Taherian, Şefik Emre Eskimez, Takuya Yoshioka

arXiv

Strong causal PSE paper, not SSI. The pVAD-guided loss is the part that holds up under full-text reading.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: The useful contribution is training-time trade-off control for causal PSE, not a new interface or SSI method.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: The work remains acoustic causal PSE and does not extend to silent-speech interaction. The evidence is scenario-specific to TS1/TS2/TS3 simulations. No user-facing interaction or SSI deployment is claimed. Causal personalized speech enhancement only. Overclaim risk: low-medium.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: personalized speech enhancement
Modality: mixed speech audio plus target-speaker conditioning
Hardware: microphone
Output: speech-audio
Metrics: The proposed S1 model keeps TS1/TS2 over-suppression close to B1 while improving TS3 leakage energy from 46.5 dB to 148.5 dB; in TS2 it changes WER from 16.8 to 17.8 and TSOS from 0.45 to 0.37.
Evaluation mode: WER, DEL, DNSMOS, STOI, TSOS, and leakage-energy evaluation on TS1/TS2/TS3 scenarios
Review confidence: high
Overclaim risk: low-medium

Expert take

This paper is best read as a careful training-method paper for causal personalized enhancement. The full text shows the trade-off explicitly in Table 1: adding ITS samples kills leakage in TS3 but worsens over-suppression, and the proposed pVAD-guided losses recover much of that damage. That is a meaningful result for speech enhancement, but it remains adjacent to SSI rather than part of it.

True value

The useful contribution is training-time trade-off control for causal PSE, not a new interface or SSI method.

What changed

Canon before

Causal personalized speech enhancement usually reduced either over-suppression or interference leakage, but not both at once.

Delta from canon

The pVAD task is used during training to suppress misleading frames from inactive-target scenarios rather than treating every frame equally.

Position in field

A solid adjacent speech-enhancement paper, clearly outside core SSI.

Evidence

“ Specifically, we utilize a personalized voice activity detec- leakage by addressing only one problem at the expense of the other. tor (pVAD) during training to exclude the non-target speech frames We propose a cross-task knowledge distillation approach to re- that are wrongly identified as containing the target speaker with hard duce both speech over-suppression and interference leakage and thus or soft classification. ”

author_claim · Abstract · confidence 0.97

“ 1: Schematic diagram of E3Net training with cross-task knowledge distillation. (a) Misclassified frames are excluded from PSE loss. (b) Noisy signal Y is used as the reference signal for misclassified frames. (c) Active target speaker probabilities are used as weights in PSE loss. ”

actual_novelty · 3.2. PSE Training with Cross-task Knowledge Distillation · confidence 0.95

“ CONCLUSION DEL and TSOS values that are close to the results of B1 for both TS1 and TS2 while achieving almost the same ∆N value as B3 for In this work, we introduced a new causal PSE model training method TS3. ”

metric · 4.4. Results and Discussions · confidence 0.97

“ Unlike unconditional speech enhancement, age would be to add inactive target speaker (ITS) samples in the causal PSE models may occasionally remove the target speech by training data [6] and train the PSE model to generate zero signals mistake. ”

limitation · 5. CONCLUSION · confidence 0.94

Limits

Technical limits

The work remains acoustic causal PSE and does not extend to silent-speech interaction.

Evaluation limits

The evidence is scenario-specific to TS1/TS2/TS3 simulations.

Deployment limits

No user-facing interaction or SSI deployment is claimed.

Scope limits

Causal personalized speech enhancement only.