"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with Vision-Language Models
This is a strong CHI paper because it turns a familiar VLM evaluation question into a disability-centered one with concrete empirical evidence. The novelty is not a new model, but a new lens and dataset that expose how image quality and BLV capture conditions change the reliability story.
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- descriptive knowledge typical · 92/268
- Novelty type
- empirical finding typical · 68/268
- Abstraction level
- task typical · 36/268
- Generalization target
- task class typical · 63/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- low typical · 53/268
Review Summary
This paper’s main contribution is best understood as an empirical and methodological reframing of VLM evaluation for accessibility. Rather than treating product captioning as a generic benchmark problem, it asks what happens when the images come from BLV people and contain the kinds of blur, misframing, and rotation that arise in everyday use. That shift is important because it challenges the common assumption that a model’s apparent competence on clean or standard images transfers to assistive settings. The paper supports that argument with a survey of 86 BLV participants, an annotated dataset of 1,859 product images, and evaluation of four VLMs. The reported pattern is clear: high-quality images yield very strong performance, but quality issues substantially reduce accuracy, and the degradation becomes worse as issues compound. The paper also explicitly claims first-of-its-kind systematic examination of image quality effects on product recognition, which makes the novelty credible even if the underlying technical machinery is not itself a new algorithm. I would classify the contribution as an empirical finding plus a dataset/measurement contribution, with the broader field-level value coming from the normative argument that evaluation should center disabled people’s experiences throughout the process. The limitations are appropriately stated and matter for interpretation: the study focuses on product identification rather than full caption quality, collapses image quality into a binary treatment, and is shaped by U.S./English-speaking data. Those constraints do not weaken the core result, but they do bound the generality of the claims. Overall, this is a convincing CHI honorable-mention-level paper because it combines a socially important problem framing with a solid evaluation design and a result that is both practically relevant and likely to influence how accessibility-oriented VLM work is assessed.
What Changed
Canon before
Prior work on VLMs for BLV assistance typically evaluates captioning or recognition performance on general image sets, but does not center disability-specific capture conditions or systematically isolate common image-quality failures in product photos.
Departure from common sense
The paper argues that VLM product-captioning reliability cannot be treated as a generic vision problem: image-quality failures and disability-centered capture conditions materially change performance for BLV users, so evaluation must be grounded in disabled people's lived image-taking practices rather than assuming sighted-user norms transfer.
Actual novelty
The paper’s novelty is an empirical and methodological reframing: it combines a BLV-sourced product-image dataset, disability-centered annotation, and multi-model evaluation to show how blur, framing, and rotation degrade product captioning in ways that matter for BLV users. It also contributes a structured annotation scheme for product, brand, and variety.
Evidence
The paper combines a survey of 86 BLV participants with an annotated dataset of 1,859 product images and evaluates four VLMs on product identification under image-quality issues. Reported results show strong performance on high-quality images and substantial degradation when quality issues are present, with compounding effects across issues. The paper also explicitly states a first-of-its-kind systematic examination and names several limitations that constrain generalization.
“ 4 Study 2: Evaluating VLM Caption Accuracy for Product Understanding Given the pervasive challenges with using VLMs to identify products, we systematically examine how image quality issues affect a VLM’s ability to identify them correctly and in d”
actual novelty · Discussion · confidence 0.96
“ 3 Study 1: Understanding BLV People’s Preferences, Experiences, and Challenges with AI-based Captioning of Product Images To understand how image quality issues relate to errors during captioning, we first study BLV people’s experiences using VLM-based tools to identify and understand products, such as household goods and foods. We ”
departure from common sense · Introduction / Discussion framing · confidence 0.95
“ In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (2018), Vol. 6. 184–192. https://ojs.aaai.org/index.php/HCOMP/article/view/13341 Crossref Google Scholar [136] Le Yang, Ziwei Zheng, Boxu Chen, Zhengyu Zhao, Chenhao Li”
limitation · Limitations and Future Work · confidence 0.98
“ Based on a survey of 86 BLV participants, we develop an annotated dataset of 1,859 product images from BLV people to systematically evaluate how image quality issues affect VLM-generated captions”
validation scope · Abstract / Study 2 method and conclusion · confidence 0.97
Limits
Method limits
The evaluation is centered on product identification accuracy rather than holistic caption quality, and image-quality issues are reduced to a binary variable, which simplifies the underlying phenomenon and may miss gradations or interactions beyond the coded categories.
Deployment limits
The findings are most directly applicable to BLV product-photo captioning workflows using the evaluated VLMs and similar image conditions; they do not by themselves establish performance for broader captioning tasks, other assistive contexts, or future model versions.
Boundary conditions
The paper’s own framing indicates U.S. and English-speaking bias in the data and evaluation, and it emphasizes that performance worsens as image-quality issues compound, so the results are bounded by the specific product-photo setting and the coded quality issues.
Position in field
This work sits at the intersection of accessible AI and VLM evaluation, shifting the unit of analysis from generic benchmark accuracy to disability-centered product understanding. Its contribution is primarily empirical and methodological: it supplies evidence that image quality is a key failure mode for BLV-oriented product captioning and argues for evaluation practices that reflect disabled users’ capture conditions.