CHI '26 · Best paper · full-paper review · confidence high

µCap: Instrumental Music Captions for Deaf and Hard-of-Hearing Individuals

SooYeon Ahn , In-Chang Baek , KyungJoong Kim , Khai N. Truong , Jin-Hyuk Hong

µCap is compelling because it reframes captions for instrumental music as sound-mimetic, time-aligned renderings rather than lyric substitutes or generic labels. The paper contributes a real system and user evidence, but its claims should stay bounded by Korean-language design choices, mainly classical evaluation material, and acknowledged mismatches between generated captions and perceived sound.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: generative knowledge typical · 35/268
Novelty type: system architecture typical · 35/268
Abstraction level: artifact typical · 19/268
Generalization target: user population typical · 75/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: low typical · 53/268

Review Summary

µCap stands out because it tackles a neglected accessibility problem with a concrete technical and design response rather than only arguing that the problem exists. The central contribution is not merely “captions for music,” but a specific reframing of what captions can be for instrumental audio: phonetic-like, non-lexical, time-synchronized text that tries to preserve texture, rhythm, and expressive contour instead of translating sound into ordinary semantic description. That is a meaningful departure from standard caption assumptions and gives the paper a clear conceptual identity. The implementation also matters. The authors do not stop at a design sketch; they build a pipeline that combines expert-derived heuristics, audio feature extraction, retrieval-augmented generation, and visual rendering choices informed by preliminary participant input. The validation story is reasonably strong for a CHI systems-and-accessibility paper because it includes formative work, expert discussion, and two user evaluations with DHH participants showing gains in appreciation, immersion, and perceived acoustic detail. At the same time, the paper does not justify reading µCap as a universal solution. Its own limitations are important and credible: the system was applied only to Korean, it was evaluated primarily with classical music, and the authors explicitly acknowledge that generated captions were not always fully appropriate to how users perceived musical sounds. Those caveats substantially shape how the contribution should be interpreted. In my view, this is best understood as a foundational artifact and method contribution that opens a new design space for instrumental music accessibility. It demonstrates feasibility and user value, and it gives future researchers a concrete starting point for broader multilingual, multi-genre, and real-world deployment work, but it should not yet be treated as a settled standard for music captioning.

What Changed

Canon before

Instrumental music is often inaccessible to Deaf and Hard-of-Hearing (DHH) individuals because existing captioning techniques focus mainly on vocal music with clear lyrics, and there are no established standards for textually representing instrumental music's rich acoustic and emotional content.

Departure from common sense

Instead of treating music captions as lyric transcription or ordinary descriptive labeling, the paper proposes a phonetic-like, sound-mimetic textual representation for instrumental passages. That is a notable shift because it asks captions to imitate musical texture and timing rather than translate music into semantic prose.

Actual novelty

The paper’s novelty is the µCap system architecture: it combines expert-derived phonetic heuristics, a retrieval-augmented generation pipeline, and audio-feature-driven caption rendering to produce time-aligned non-lexical captions for instrumental music. The contribution is not just a new caption style, but a working pipeline evaluated with DHH users.

Evidence

The paper grounds its design in a preliminary survey with DHH participants and expert discussions, then implements µCap as an automated captioning pipeline using audio features, retrieval, and generation. Validation includes two user evaluations with DHH participants and reports improved appreciation, immersion, and perceived acoustic detail, while the limitations section narrows claims to Korean-language captions, mainly classical evaluation material, and imperfect sound-to-text alignment.

“ We propose µCap (Music Captions), an automatic instrumental music captioning system that transforms instrumental audio into time-aligned, non-lexical textual renderings enhanced with simple visuals”

actual novelty · Abstract · confidence 0.98

“ To the best of our knowledge, this is a new approach for enhancing instrumental music accessibility. (2) Through expert group discussions, we derived a heuristic guideline that interprets and converts musical sounds as text. (3) To address the challenge of generati”

departure from common sense · 1 Introduction · confidence 0.96

“ure work. First, µCap was applied only to the Korean language. Although Korean is sound-based, its alphabetic system includes symbols that effectively represent some auditory expressions while lacking direct equivalents for others. To generalize the approach and extend it to a broader DHH audience, future research should expand the system to addi”

limitation · 9 Limitations · confidence 0.98

“ Two user evaluations with DHH participants (n=20 and n=15) showed that µCap enhanced music appreciation, immersion, and perceived presence of acoustic detail.”

validation scope · Abstract · confidence 0.95

Limits

Method limits

The approach is limited by language specificity and by the difficulty of mapping musical sounds into natural text. The paper states that µCap was applied only to Korean, that captions were not always fully appropriate to how musical sounds are perceived, and that broader training data and better feature-to-language mapping are needed.

Deployment limits

The evaluations were run in controlled online study settings with short 15-second clips and relatively small participant samples. The paper also notes that future work must validate the approach across more genres and real-world listening contexts before broader deployment.

Boundary conditions

The reported results are tied mainly to Korean-language captions and to the evaluated instrumental music clips, especially classical music in the main evaluation. The approach may behave differently for other languages, genres, instruments, and listener communities.

Position in field

This work is a pioneering accessibility contribution in an underexplored area: automatic captioning for instrumental music. It extends CHI accessibility work beyond speech and lyric captions toward a multimodal, non-lexical representation strategy for DHH audiences.

Abstract

Instrumental music conveys rich affective experiences through acoustic cues, yet instrumental passages often remain inaccessible to Deaf and Hard-of-Hearing (DHH) audiences. Although captioning practices for vocal songs have expanded, instrumental music remains largely uncaptioned, with no established criteria for representing musical content in text. We propose 𝜇Cap (Music Captions), an automatic instrumental music captioning system that transforms instrumental audio into time-aligned, non-lexical textual renderings enhanced with simple visuals. Drawing on Preliminary surveys with DHH individuals and expert group discussions, we developed a phonetic-like captioning schema grounded in music sound analysis and linguistics. We then implemented 𝜇Cap using audio feature extraction and a Retrieval-Augmented Generation(RAG) pipeline to produce expressive, sound-mimetic captions. Two user evaluations with DHH participants (n=20 and n=15) showed that 𝜇Cap enhanced music appreciation, immersion, and perceived presence of acoustic detail. This work contributes empirical evidence and insights for designing caption-based visual representations that make instrumental music more accessible.