2020 · arXiv / imported corpus page · Field expert review · confidence high

End-to-End Speaker-Dependent Voice Activity Detection

Yefei Chen, Shuai Wang, Yanmin Qian, Kai Yu

arXiv

Strong target-speaker VAD paper, not SSI.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: Competent target-speaker VAD paper, but not an SSI contribution.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: Speaker-dependent task only and still vulnerable to fragmentation without feature binning or post-processing. Segment-level quality lags frame-level gains, so the best headline metrics overstate temporal cleanliness. Useful for speech pipelines, but unrelated to silent-speech interface deployment. Target-speaker voice activity detection only. Overclaim risk: medium.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: audio-classification
Modality: speech audio
Hardware: microphone
Output: labels
Metrics: Best LSTM SDVAD+binning+post reaches 94.62% ACC and 93.47% F-score; segment-level J-VAD for LSTM SDVAD+binning is 73.66% versus 76.68% for the LSTM VAD/SV baseline due to fragmentation effects
Evaluation mode: frame-level ACC/F-score and segment-level J-VAD analysis
Review confidence: high
Overclaim risk: medium

Expert take

The full text supports the claimed speech-processing advance: end-to-end speaker-aware VAD improves frame accuracy and F-score over the two-stage baseline and can run online with negligible latency. The main caution is scope. This is target-speaker activity detection for audible speech, not silent-speech sensing, and even inside its own task the paper shows segment-level fragmentation problems that need feature binning/post-processing to stabilize boundaries.

True value

Competent target-speaker VAD paper, but not an SSI contribution.

What changed

Canon before

Speaker-dependent VAD was usually implemented as a two-stage VAD plus speaker-verification cascade with added latency.

Delta from canon

Moves target-speaker conditioning inside the model and shows large frame-level gains from feature binning and end-to-end training.

Position in field

Speech-processing paper adjacent to SSI only through target-speaker filtering, not silent speech.

Evidence

“ 3.4 End-to-end speaker-dependent VAD system (SDVAD) 3.5 Post-processing and Feature Binning According to the baseline system, the speaker verifica- VAD is different from common binary classification tion stage is after obtaining VAD prediction results of problems since the audio signal is characterized by con- the whole utterance, which increase system latency. tinuity which means adjacent frames are highly corre- Moreover, it does not directly optimize the ultimate lated. ”

author_claim · 5. Conclusion · confidence 0.98

“ For the LSTM model, we Results for the frame-level evaluation are reported in use feature binning to keep the continuity of speech and terms of accuracy (ACC) and F-score (F1, harmonic reduce the computation cost. mean of precision and recall), which are listed in Table 1. ”

metric · Table 1: ACC(%) and F-score(%) of different systems. VAD / SV · confidence 0.98

“ For the LSTM model, we Results for the frame-level evaluation are reported in use feature binning to keep the continuity of speech and terms of accuracy (ACC) and F-score (F1, harmonic reduce the computation cost. mean of precision and recall), which are listed in Table 1. ”

limitation · 4.4 Segment level Evaluation · confidence 0.96

Limits

Technical limits

Speaker-dependent task only and still vulnerable to fragmentation without feature binning or post-processing.

Evaluation limits

Segment-level quality lags frame-level gains, so the best headline metrics overstate temporal cleanliness.

Deployment limits

Useful for speech pipelines, but unrelated to silent-speech interface deployment.

Scope limits

Target-speaker voice activity detection only.