CHI '26 · Honorable mention · full-paper review · confidence medium-high

"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with Vision-Language Models

Kapil Garg , Xinru Tang , Jimin Heo , Dwayne R Morgan , Darren Gergle , Erik Sudderth , Anne Marie Piper

This is a strong CHI paper because it turns a familiar VLM evaluation question into a disability-centered one with concrete empirical evidence. The novelty is not a new model, but a new lens and dataset that expose how image quality and BLV capture conditions change the reliability story.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: empirical finding typical · 68/268
Abstraction level: task typical · 36/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: low typical · 53/268

Review Summary

This paper’s main contribution is best understood as an empirical and methodological reframing of VLM evaluation for accessibility. Rather than treating product captioning as a generic benchmark problem, it asks what happens when the images come from BLV people and contain the kinds of blur, misframing, and rotation that arise in everyday use. That shift is important because it challenges the common assumption that a model’s apparent competence on clean or standard images transfers to assistive settings. The paper supports that argument with a survey of 86 BLV participants, an annotated dataset of 1,859 product images, and evaluation of four VLMs. The reported pattern is clear: high-quality images yield very strong performance, but quality issues substantially reduce accuracy, and the degradation becomes worse as issues compound. The paper also explicitly claims first-of-its-kind systematic examination of image quality effects on product recognition, which makes the novelty credible even if the underlying technical machinery is not itself a new algorithm. I would classify the contribution as an empirical finding plus a dataset/measurement contribution, with the broader field-level value coming from the normative argument that evaluation should center disabled people’s experiences throughout the process. The limitations are appropriately stated and matter for interpretation: the study focuses on product identification rather than full caption quality, collapses image quality into a binary treatment, and is shaped by U.S./English-speaking data. Those constraints do not weaken the core result, but they do bound the generality of the claims. Overall, this is a convincing CHI honorable-mention-level paper because it combines a socially important problem framing with a solid evaluation design and a result that is both practically relevant and likely to influence how accessibility-oriented VLM work is assessed.

What Changed

Canon before

Prior work on VLMs for BLV assistance typically evaluates captioning or recognition performance on general image sets, but does not center disability-specific capture conditions or systematically isolate common image-quality failures in product photos.

Departure from common sense

The paper argues that VLM product-captioning reliability cannot be treated as a generic vision problem: image-quality failures and disability-centered capture conditions materially change performance for BLV users, so evaluation must be grounded in disabled people's lived image-taking practices rather than assuming sighted-user norms transfer.

Actual novelty

The paper’s novelty is an empirical and methodological reframing: it combines a BLV-sourced product-image dataset, disability-centered annotation, and multi-model evaluation to show how blur, framing, and rotation degrade product captioning in ways that matter for BLV users. It also contributes a structured annotation scheme for product, brand, and variety.

Evidence

The paper combines a survey of 86 BLV participants with an annotated dataset of 1,859 product images and evaluates four VLMs on product identification under image-quality issues. Reported results show strong performance on high-quality images and substantial degradation when quality issues are present, with compounding effects across issues. The paper also explicitly states a first-of-its-kind systematic examination and names several limitations that constrain generalization.

“ 4 Study 2: Evaluating VLM Caption Accuracy for Product Understanding Given the pervasive challenges with using VLMs to identify products, we systematically examine how image quality issues affect a VLM’s ability to identify them correctly and in d”

actual novelty · Discussion · confidence 0.96

“ 3 Study 1: Understanding BLV People’s Preferences, Experiences, and Challenges with AI-based Captioning of Product Images To understand how image quality issues relate to errors during captioning, we first study BLV people’s experiences using VLM-based tools to identify and understand products, such as household goods and foods. We ”

departure from common sense · Introduction / Discussion framing · confidence 0.95

“ In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (2018), Vol. 6. 184–192. https://ojs.aaai.org/index.php/HCOMP/article/view/13341 Crossref Google Scholar [136] Le Yang, Ziwei Zheng, Boxu Chen, Zhengyu Zhao, Chenhao Li”

limitation · Limitations and Future Work · confidence 0.98

“ Based on a survey of 86 BLV participants, we develop an annotated dataset of 1,859 product images from BLV people to systematically evaluate how image quality issues affect VLM-generated captions”

validation scope · Abstract / Study 2 method and conclusion · confidence 0.97

Limits

Method limits

The evaluation is centered on product identification accuracy rather than holistic caption quality, and image-quality issues are reduced to a binary variable, which simplifies the underlying phenomenon and may miss gradations or interactions beyond the coded categories.

Deployment limits

The findings are most directly applicable to BLV product-photo captioning workflows using the evaluated VLMs and similar image conditions; they do not by themselves establish performance for broader captioning tasks, other assistive contexts, or future model versions.

Boundary conditions

The paper’s own framing indicates U.S. and English-speaking bias in the data and evaluation, and it emphasizes that performance worsens as image-quality issues compound, so the results are bounded by the specific product-photo setting and the coded quality issues.

Position in field

This work sits at the intersection of accessible AI and VLM evaluation, shifting the unit of analysis from generic benchmark accuracy to disability-centered product understanding. Its contribution is primarily empirical and methodological: it supplies evidence that image quality is a key failure mode for BLV-oriented product captioning and argues for evaluation practices that reflect disabled users’ capture conditions.

Abstract

Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal care items, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues — such as blur, misframing, and rotation — affect the accuracy of VLM-generated captions and whether the resulting captions meet BLV people's information needs. Based on a survey of 86 BLV participants, we develop an annotated dataset of 1,859 product images from BLV people to systematically evaluate how image quality issues affect VLM-generated captions. While the best VLM achieves 98% accuracy on images with no quality issues, accuracy drops to 75% overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.