An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Strong AV speech survey, not an SSI system paper.
Reading guidance
- Verdict
- full-text draft · priority medium · confidence high
- Why it matters
- Good map of AV speech enhancement/separation design choices and evaluation gaps, but not a primary SSI contribution.
- What to trust
- Basis: full text. Coverage: high. 3 evidence records back the review.
- What is weak
- No new benchmark, no unified re-evaluation, and no direct empirical comparison across systems. The paper explicitly states that lack of standardized AV evaluation makes broad performance ranking hard to interpret. Not a deployable system; conclusions remain dependent on the underlying papers it surveys. Survey of AV speech enhancement and separation, plus adjacent silent-video and non-speech AV source separation work. Overclaim risk: low.
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- survey
- Modality
- audio + video
- Hardware
- microphone + camera
- Body site
- face; lip
- Output
- speech-audio
- Metrics
- Survey identifies PESQ, STOI/ESTOI, SDR/SI-SDR, and WER as common metrics, while noting the lack of standardized audio-visual evaluation procedures
- Evaluation mode
- literature review
- Review confidence
- high
- Overclaim risk
- low
Expert take
The full text is strong as a survey: it synthesizes how AV speech enhancement and separation systems are built, where visual input helps most, what fusion choices dominate, and why evaluation remains hard to compare across papers. Its practical value is field-mapping rather than new algorithmic evidence. For SSI work, it is adjacent because it touches silent-video speech reconstruction and multimodal speech processing, but it does not introduce a new silent-speech interface or benchmark.
True value
Good map of AV speech enhancement/separation design choices and evaluation gaps, but not a primary SSI contribution.
What changed
Canon before
AV speech enhancement and separation knowledge was dispersed across modality choices, fusion strategies, datasets, and evaluation practices.
Delta from canon
Organizes the area into acoustic features, visual features, deep learning methods, fusion techniques, training targets, datasets, and evaluation gaps.
Position in field
Survey paper adjacent to SSI via audio-visual speech processing rather than silent-speech interaction itself.
Evidence
“ In the of several elements, described and discussed in the following second case, Zhu et al. [293] provided a bird’s-eye view of sections, specifically: acoustic features (in Section IV); visual several AV tasks, to which deep learning has been applied. features (in Section V); deep learning methods (in Section VI); Although AV-SE and AV-SS are discussed, the presentation fusion techniques (in Section VII); training targets and objec- covers only five approaches. tive functions (in Section VIII). ”
author_claim · XII. C ONCLUSION · confidence 0.98
“ Especially when the receiver more or less directly applied to audio-visual speech enhancement of an enhanced speech signal is a human, SE systems are and separation. ”
fact · XII. C ONCLUSION · confidence 0.95
“ This measure assumes that the receiver of the signals is a machine, not a human, and it provides additional In this paper, we presented an overview of deep-learning- performance information for specific applications, e.g. video based approaches for audio-visual speech enhancement (AV- captioning for teleconferences or augmented reality. ”
limitation · XII. C ONCLUSION · confidence 0.97
Limits
Technical limits
No new benchmark, no unified re-evaluation, and no direct empirical comparison across systems.
Evaluation limits
The paper explicitly states that lack of standardized AV evaluation makes broad performance ranking hard to interpret.
Deployment limits
Not a deployable system; conclusions remain dependent on the underlying papers it surveys.
Scope limits
Survey of AV speech enhancement and separation, plus adjacent silent-video and non-speech AV source separation work.