2020 · arXiv / imported corpus page · Field expert review · confidence high

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti, Zheng‐Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

arXiv

Strong AV speech survey, not an SSI system paper.

Verdict: full-text draftPriority: mediumConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority medium · confidence high
Why it matters: Good map of AV speech enhancement/separation design choices and evaluation gaps, but not a primary SSI contribution.
What to trust: Basis: full text. Coverage: high. 3 evidence records back the review.
What is weak: No new benchmark, no unified re-evaluation, and no direct empirical comparison across systems. The paper explicitly states that lack of standardized AV evaluation makes broad performance ranking hard to interpret. Not a deployable system; conclusions remain dependent on the underlying papers it surveys. Survey of AV speech enhancement and separation, plus adjacent silent-video and non-speech AV source separation work. Overclaim risk: low.
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: survey
Modality: audio + video
Hardware: microphone + camera
Body site: face; lip
Output: speech-audio
Metrics: Survey identifies PESQ, STOI/ESTOI, SDR/SI-SDR, and WER as common metrics, while noting the lack of standardized audio-visual evaluation procedures
Evaluation mode: literature review
Review confidence: high
Overclaim risk: low

Expert take

The full text is strong as a survey: it synthesizes how AV speech enhancement and separation systems are built, where visual input helps most, what fusion choices dominate, and why evaluation remains hard to compare across papers. Its practical value is field-mapping rather than new algorithmic evidence. For SSI work, it is adjacent because it touches silent-video speech reconstruction and multimodal speech processing, but it does not introduce a new silent-speech interface or benchmark.

True value

Good map of AV speech enhancement/separation design choices and evaluation gaps, but not a primary SSI contribution.

What changed

Canon before

AV speech enhancement and separation knowledge was dispersed across modality choices, fusion strategies, datasets, and evaluation practices.

Delta from canon

Organizes the area into acoustic features, visual features, deep learning methods, fusion techniques, training targets, datasets, and evaluation gaps.

Position in field

Survey paper adjacent to SSI via audio-visual speech processing rather than silent-speech interaction itself.

Evidence

“ In the of several elements, described and discussed in the following second case, Zhu et al. [293] provided a bird’s-eye view of sections, specifically: acoustic features (in Section IV); visual several AV tasks, to which deep learning has been applied. features (in Section V); deep learning methods (in Section VI); Although AV-SE and AV-SS are discussed, the presentation fusion techniques (in Section VII); training targets and objec- covers only five approaches. tive functions (in Section VIII). ”

author_claim · XII. C ONCLUSION · confidence 0.98

“ Especially when the receiver more or less directly applied to audio-visual speech enhancement of an enhanced speech signal is a human, SE systems are and separation. ”

fact · XII. C ONCLUSION · confidence 0.95

“ This measure assumes that the receiver of the signals is a machine, not a human, and it provides additional In this paper, we presented an overview of deep-learning- performance information for specific applications, e.g. video based approaches for audio-visual speech enhancement (AV- captioning for teleconferences or augmented reality. ”

limitation · XII. C ONCLUSION · confidence 0.97

Limits

Technical limits

No new benchmark, no unified re-evaluation, and no direct empirical comparison across systems.

Evaluation limits

The paper explicitly states that lack of standardized AV evaluation makes broad performance ranking hard to interpret.

Deployment limits

Not a deployable system; conclusions remain dependent on the underlying papers it surveys.

Scope limits

Survey of AV speech enhancement and separation, plus adjacent silent-video and non-speech AV source separation work.