CHI '26 · Honorable mention · full-paper review · confidence medium-high

Like, Comment & Caption: A Decade of Social Media Video Caption Research (2015–2025)

Huong Nguyen , Emma J McDonnell , Lloyd May , Alexander Druzenko , Zoobia Saifullah Syeda , Mark Cartwright , Sooyeon Lee

DOI PDF Program page

This is a strong CHI synthesis paper: its main contribution is not a new interface, but a field-level reframing of social media video captions as participatory infrastructure. The review scope is clear, the evidence base is explicit, and the limitations are appropriately bounded by language and search coverage.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: normative knowledge typical · 31/268
Novelty type: framework typical · 59/268
Abstraction level: practice typical · 85/268
Generalization target: field argument typical · 55/268
Validation mode: survey synthesis typical · 10/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: low typical · 53/268

Review Summary

This paper’s value lies in synthesis and reframing. Rather than presenting captioning as a narrow accessibility feature or a creator-only workflow, it argues that captions are collectively produced and maintained across viewers, creators, and platforms. That is a meaningful departure from the more common sense view of captions as a one-way output attached to video. The novelty is not in a new technical artifact, but in the Participatory Captioning framework, which organizes the literature into a more coherent socio-technical account and then uses that account to motivate design implications and future research directions. The validation mode is appropriately a survey synthesis: the paper reviews 36 peer-reviewed studies from 2015–2025 and explicitly positions itself as a systematic review rather than an empirical test of the framework. That makes the evidence fit the claim well. The paper is also careful about limits. It acknowledges English-only inclusion and the dependence on Google Scholar and SerpAPI rather than a broader multi-database search strategy, which means the synthesis may underrepresent region-specific or non-English practices. For CHI, this is a solid honorable-mention-level contribution because it clarifies a fragmented area, names an important conceptual shift, and does so with transparent scope and bounded claims. The main caution is that the framework should be read as a field argument and design lens, not as a validated causal theory or a universally generalizable model. The review is especially useful because it does not stop at describing the literature; it also surfaces tensions around labor, feedback, moderation, and platform responsibility, which makes the framework actionable for future design and policy work. The strongest reading is that the paper contributes an organizing vocabulary and a research agenda for a fast-moving area where platform practices, accessibility needs, and creator labor are tightly entangled.

What Changed

Canon before

Prior work treated social media video captions mainly as an accessibility aid, creator-side feature, or platform affordance, with separate studies across communities and caption types rather than a unified synthesis. The literature was fragmented across HCI, accessibility, education, and media studies, so the baseline expectation would be a descriptive review that catalogs studies without materially changing how the field understands captioning as a socio-technical practice.

Departure from common sense

The paper argues that captioning is not just a static feature added by creators or platforms; instead, it is a collective infrastructure co-produced by viewers, creators, and platforms. That move shifts the unit of analysis away from isolated caption artifacts and toward an ongoing participatory system in which feedback, moderation, labor, and platform governance all shape what captions become in practice.

Actual novelty

The paper’s Participatory Captioning framework reframes social media video captioning as a participatory, collective process and uses that synthesis to motivate design implications and future research directions. Its novelty is field-organizing rather than technical: it names a decade of scattered work, integrates viewer, creator, and platform perspectives, and turns that synthesis into a coherent socio-technical lens for accessibility, engagement, and governance. The contribution is strongest as a conceptual framework and literature synthesis, not as a new system or causal claim, but it is still substantively novel because it consolidates a fragmented area into a reusable design argument for future SMVC research.

Evidence

The paper synthesizes 36 peer-reviewed studies published from 2015 to 2025 using a two-stage Google Scholar plus SerpAPI search and PRISMA-style screening. It reports that captions function as collective infrastructure and introduces Participatory Captioning as the organizing framework for the review’s implications. The evidence base is broad for a synthesis paper, and the authors also explicitly acknowledge English-only scope and incomplete database coverage, which keeps the claims appropriately bounded.

“ Deaf and Hard of Hearing (DHH), neurodivergent, and multilingual viewers depend on captions and increasingly expect mechanisms for feedback, while creators face inadequate tool support. Building on these insights, we propose the framework of Participatory Captioning and suggest design implications, highlighting future directions for social media video caption research”

actual novelty · Share on · confidence 0.90

“ We note that captions operate as collective infrastructure co-produced by viewers, creators, and platforms”

departure from common sense · Share on · confidence 0.86

“3 Limitations Although work from diverse communities exists [ 46 , 69 , 142 , 144 ], our review is limited to studies published in English.”

limitation · 6.3 Limitations · confidence 0.96

“ This paper reviews 36 peer-reviewed papers published between 2015 and 2025 across fields such as Human-Computer Interaction (HCI), accessibility, media studies, education, and language learning”

validation scope · Share on · confidence 0.84

Limits

Method limits

The review’s evidence base is bounded by its search and screening choices, including reliance on Google Scholar and SerpAPI plus English-language studies. That means relevant work outside those channels, languages, or the selected HCI-oriented venue sweep may be underrepresented, and the corpus should be read as scoped rather than exhaustive.

Deployment limits

The framework is most directly applicable to social media video captioning research and design. Transfer to other captioning contexts should be cautious because platform constraints, creator practices, moderation regimes, and viewer expectations differ across domains and may change the meaning of participatory captioning.

Boundary conditions

Findings are grounded in a 2015–2025 corpus of 36 peer-reviewed papers and are shaped by the review’s inclusion criteria, database coverage, and English-only scope. The framework is therefore best treated as a field argument for SMVC rather than a universal model of captioning across all media or languages.

Position in field

This is a synthesis contribution that consolidates a fragmented literature and repositions captioning as participatory infrastructure, offering a field-level framing rather than a new system or experimental result. It sits between accessibility review and design theory, with its main value in organizing prior work, clarifying tensions among viewers, creators, and platforms, and pointing to future research and governance questions.

Abstract

As video has become the dominant mode of content on platforms such as YouTube, TikTok, and Instagram, captioning has emerged as a critical factor for accessibility, engagement, and visibility. While prior studies have examined different types of social media video captions or communities' captioning usage, a systematic synthesis has not been undertaken, leading to the risk of proposing interventions that overlook core platform constraints or miss critical accessibility needs. This paper reviews 36 peer-reviewed papers published between 2015 and 2025 across fields such as Human-Computer Interaction (HCI), accessibility, media studies, education, and language learning. We note that captions operate as collective infrastructure co-produced by viewers, creators, and platforms. Deaf and Hard of Hearing (DHH), neurodivergent, and multilingual viewers depend on captions and increasingly expect mechanisms for feedback, while creators face inadequate tool support. Building on these insights, we propose the framework of Participatory Captioning and suggest design implications, highlighting future directions for social media video caption research.