High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
This video-conditioned AVO system innovatively supervises alignment by predicting discrete speech units rather than reconstructing acoustic features, leading to better lip-sync and speech quality on a single-speaker dataset; however, it is not an SSI interface paper.
Reading guidance
- Verdict
- full-text draft · priority medium-high · confidence high
- Why it matters
- The key advance is reframing alignment supervision from acoustic regression to discrete speech unit prediction using self-supervised speech representations, delivering more direct and effective training for lip-synchronized speech generation in automatic voice over.
- What to trust
- Basis: full text. Coverage: high. 6 evidence records back the review.
- What is weak
- Single-speaker Chem dataset only; reliant on pretrained unit tokenizer and vocoder; no tests on multi-speaker, unseen words, or moving conditions; contextual mismatch remains an open issue outside studied domain. Evaluation limited to single-speaker Chem dataset; WER from ASR pretrained on Librispeech only and not adapted; no unseen words or walking tests reported. Requires input of video frames and text scripts; depends on pretrained unit tokenizer and vocoder; only validated on single-speaker Chem dataset; no multi-speaker or in-the-wild testing yet. Specific to automatic voice over with text and video inputs; not a generic silent speech interface solution. Overclaim risk: low.
- Read before
- SSI review rubric
- Read next
- SSI archive
Axes
- Task
- automatic voice over
- Modality
- text + video (lip image frames)
- Hardware
- camera
- Body site
- lip
- Output
- speech-audio
- Metrics
- Lip Sync Error Confidence (LSE-C) 6.81, Lip Sync Error Distance (LSE-D) 7.56, Frame Disturbance (FD) 3.23, Word Error Rate (WER) 24.7%, Mean Opinion Score (MOS) 3.98 ± 0.08, Best-Worst Scaling (BWS) best 84.0% / worst 1.3%
- Evaluation mode
- Objective metrics (LSE-C, LSE-D, FD, WER) and subjective listening tests (MOS, BWS)
- Review confidence
- high
- Overclaim risk
- low
Expert take
This paper presents DSU-AVO, an automatic voice over system that innovatively supervises the multimodal alignment learning through discrete speech unit prediction derived from self-supervised HuBERT models and clustering. By replacing acoustic feature reconstruction with classification of discrete units at the context level, the approach provides more direct learning signals that yield improved lip-sync accuracy and speech naturalness. The system integrates a pretrained unit vocoder for synthesis conditioned on predicted units, effectively alleviating the mismatch typical in prior acoustic decoding. Experimental evaluation on the single-speaker Chem dataset demonstrates significant gains over baselines in synchronization metrics (LSE-C 6.81 vs. 6.11 for Neural Dubber), duration deviation (FD 3.23 vs. 9.39), intelligibility (WER 24.7% vs. 75.8%), and subjective quality (MOS 3.98 vs. 2.43). Despite these strong contributions, the work is limited to single-speaker settings without multi-speaker or in-the-wild testing and depends on pretrained components, limiting immediate deployment scope. Furthermore, it does not constitute a silent speech interface per se but advances alignment supervision in video-conditioned speech synthesis.
True value
The key advance is reframing alignment supervision from acoustic regression to discrete speech unit prediction using self-supervised speech representations, delivering more direct and effective training for lip-synchronized speech generation in automatic voice over.
What changed
Canon before
Prior AVO systems used acoustic feature (mel-spectrogram) reconstruction as a learning objective, which provides indirect supervision for alignment and suffers from mismatch between context and acoustic features.
Delta from canon
Replaces acoustic feature regression with discrete speech unit prediction derived via HuBERT and k-means clustering for direct alignment supervision, and uses Unit HiFi-GAN vocoder for synthesis conditioned on predicted units, reducing mismatch and improving synchronization and naturalness.
Position in field
Advances automatic voice over and video-conditioned TTS; adjacent but not focused on silent speech interfaces.
Evidence
“ To this end, we propose a novel AVO method lever- aging the learning objective of self-supervised discrete speech Automatic Voice Over unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch be- tween the text-video context and acoustic features. ”
author_claim · Abstract · confidence 1.00
“ Du et al. [24] pro- We propose to guide the context modeling and alignment pose to reduce the complexity of the acoustic model in TTS learning of AVO more directly by imposing discrete speech unit prediction as the supervision at the context representation Speech Waveform level, given that discrete speech units are closely correlated with speech content. ”
actual_novelty · We propose to guide the context modeling · confidence 1.00
“ Chem dataset is a single- The accuracy of the prediction has a direct impact on the con- speaker audio-visual English speech dataset with official tran- tent correctness and intelligibility of the synthetic speech. ”
validation_scope · 4.1.1. Dataset · confidence 1.00
“ As shown in Table 1, our proposed sual feature extractor in both 4) and 5), we use the same AV- DSU-AVO produces speech with a higher level of naturalness HuBERT + Self-Training model4 [33] pretrained on 1,758h of than both baselines by achieving a MOS score of 3.98 ± 0.08. unlabeled Voxceleb2 data [34] and finetuned on 433h of labeled We also conduct a Best-Worst Scaling (BWS) test [39] on LRS3 data [35] for visual speech recognition. ”
metric · 4.2. Experimental results · confidence 1.00
“ Unit vocoder We utilize a text-video aligner [1, 10] to temporally align textual We utilize Unit HiFi-GAN [16], which is pretrained on ground- and visual representations by scaled dot-product attention [18], truth <units, waveform> pairs without the speaker encoder and produce text-video context with length Tv : and the F0 encoder in a single-speaker setting [31], as the unit Hv HpT vocoder. ”
limitation · 4.1.1. Dataset · confidence 1.00
“ DSU-AVO system As demonstrated in Figure 2, our proposed DSU-AVO con- Video Encoder Text Encoder sists of unit tokenizer, video encoder, text encoder, video-text aligner, unit predictor, and unit vocoder. ”
deployment_claim · 3.3. DSU · confidence 0.90
Limits
Technical limits
Single-speaker Chem dataset only; reliant on pretrained unit tokenizer and vocoder; no tests on multi-speaker, unseen words, or moving conditions; contextual mismatch remains an open issue outside studied domain.
Evaluation limits
Evaluation limited to single-speaker Chem dataset; WER from ASR pretrained on Librispeech only and not adapted; no unseen words or walking tests reported.
Deployment limits
Requires input of video frames and text scripts; depends on pretrained unit tokenizer and vocoder; only validated on single-speaker Chem dataset; no multi-speaker or in-the-wild testing yet.
Scope limits
Specific to automatic voice over with text and video inputs; not a generic silent speech interface solution.