2018 · arXiv / imported corpus page · Field expert review · confidence high

Cross-modal Embeddings for Video and Audio Retrieval

Dídac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Giró Nieto, Xavier

arXiv

Useful multimodal retrieval baseline, not SSI.

Verdict: full-text draftPriority: lowConfidence: highBasis: full textCoverage: high

Reading guidance

Verdict: full-text draft · priority low · confidence high
Why it matters: The paper is a competent audiovisual retrieval result, but it is not an SSI paper and does not decode or synthesize speech for communication.
What to trust: Basis: full text. Coverage: high. 4 evidence records back the review.
What is weak: The model does not solve fine temporal alignment and ignores much of the temporal structure available in the underlying videos. Evaluation uses a 6,000-clip subset with precomputed features rather than end-to-end raw audiovisual processing or a speech-oriented benchmark. The work supports offline retrieval, not an interactive SSI system. Outside SSI scope. Overclaim risk: Low for retrieval claims, but the paper should not be interpreted as an SSI advance..
Read before: SSI review rubric
Read next: SSI archive

Axes

Task: cross-modal retrieval
Modality: video and audio
Hardware: YouTube-8M precomputed audio windows and visual features sampled at 1 Hz
Output: retrieval ranking
Metrics: For 256 candidates, Table 1 reports audio-to-video Recall@1/5/10 of 21.5/52.0/63.1 and Table 2 reports video-to-audio Recall@1/5/10 of 22.3/51.7/64.4. Performance drops to about 10% Recall@1 once the candidate pool grows to 1024 items.
Evaluation mode: Audio-to-video and video-to-audio retrieval on a 6,000-clip YouTube-8M subset using Recall@1, Recall@5, and Recall@10 at multiple candidate-pool sizes.
Review confidence: high
Overclaim risk: Low for retrieval claims, but the paper should not be interpreted as an SSI advance.

Expert take

The full text supports a narrow but clear claim. The method learns a shared embedding for precomputed YouTube-8M audio and video features and evaluates retrieval in both directions. The best reported setting reaches roughly 22% Recall@1 and about 52% Recall@5 for 256 candidates, then degrades as the gallery grows. That makes this a reasonable lightweight retrieval paper, but the authors explicitly say they do not address exact cross-modal alignment, and the task is retrieval rather than silent speech recognition or reconstruction. It belongs in the archive only as an adjacent multimodal distractor.

True value

The paper is a competent audiovisual retrieval result, but it is not an SSI paper and does not decode or synthesize speech for communication.

What changed

Canon before

Cross-modal retrieval usually paired images with text or relied on narrower audiovisual domains such as music videos.

Delta from canon

This paper uses synchronized web video to learn a lightweight joint embedding for retrieving audio from video and video from audio at YouTube-8M scale.

Position in field

Outside SSI scope; relevant only as an adjacent multimodal representation-learning paper.

Evidence

“ As dings for both scales and assess their quality in a retrieval depicted in Figure 1, whether a video or an audio clip can be problem, formulated as using the feature extracted from one used as a query to search its matching pair in a large collection modality to retrieve the most similar videos based on the fea- of videos. ”

author_claim · Abstract · confidence 1.00

“ Evaluation of Recall from audio to video X Number of elements Recall@1 Recall@5 Recall@10 Lclass (pi , pa , ci , ca ) = − (pik log(cik )+(pak log(cak )) (4) k 256 21.5% 52.0% 63.1% 512 15.2% 39.5% 52.0% Finally, the loss function to be optimized is: 1024 9.8% 30.4% 39.6% ”

metric · Table 1. Evaluation of Recall from audio to video · confidence 1.00

“ Evaluation of Recall from video to audio Number of elements Recall@1 Recall@5 Recall@10 256 22.3% 51.7% 64.4% 4.3. ”

metric · Table 2. Evaluation of Recall from video to audio · confidence 1.00

“ INTRODUCTION not address an exact alignment between the two modalities that would require a much higher computation effort. ”

limitation · 1. INTRODUCTION · confidence 1.00

Limits

Technical limits

The model does not solve fine temporal alignment and ignores much of the temporal structure available in the underlying videos.

Evaluation limits

Evaluation uses a 6,000-clip subset with precomputed features rather than end-to-end raw audiovisual processing or a speech-oriented benchmark.

Deployment limits

The work supports offline retrieval, not an interactive SSI system.

Scope limits

Outside SSI scope.