CHI '26 · Honorable mention · full-paper review · confidence medium-high

How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

Ricardo E. Gonzalez Penuela , Crescentia Jung , Sharon Y Lin , Ruiying Hu , Shiri Azenkot

This is a strong CHI paper because it identifies a real mismatch between perceived quality and actual answer reliability, then turns that mismatch into a useful design frame. The “visual assistant” idea is a meaningful contribution for accessible AI, especially because it is grounded in diary evidence rather than speculation.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: normative knowledge typical · 31/268
Novelty type: framework typical · 59/268
Abstraction level: practice typical · 85/268
Generalization target: user population typical · 75/268
Validation mode: qualitative study typical · 63/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s value is that it does not stop at showing that multimodal LLMs can help Blind and Low Vision people access visual information; it asks what kind of help actually matters in daily life. The diary study evidence supports a subtle but important claim: users may experience the system as trustworthy and satisfying even while it still fails in concrete ways, including incorrect answers and abstentions. That is a useful corrective to evaluation practices that over-focus on descriptive accuracy alone. The proposed “visual assistant” skill is the paper’s main conceptual move, and it is well aligned with the empirical material because it reframes success as goal-directed, reliable assistance rather than just caption-like output. The study scope is credible for a CHI paper: two weeks, 20 participants, 446 diary entries, 375 conversations, and 549 answerable questions provide a substantial qualitative and descriptive basis for the claims. At the same time, the paper is appropriately bounded. It is not a controlled benchmark of model performance, and it does not justify broad causal claims about all BLV users or all MLLM systems. The authors also acknowledge meaningful limitations: possible observer effects from recording, a sample of expert users rather than beginners, uneven diary participation, and a holiday-season collection window that may have shaped use patterns. Overall, the paper’s contribution is strongest as a field-level reframing of evaluation and design for accessible multimodal AI, with a clear normative implication for future systems.

What Changed

Canon before

Prior CHI work on BLV visual interpretation tools largely emphasized descriptive outputs and task-specific assistance; this paper extends that canon by examining conversational MLLM support in everyday use and by naming a distinct assistant capability beyond description.

Departure from common sense

The paper’s core counterintuitive point is that users can still judge the system as trustworthy and somewhat satisfying even when it frequently gives wrong answers or refuses to answer. That separates perceived usefulness from raw answer correctness and makes the evaluation of BLV visual access tools more nuanced than a simple accuracy metric.

Actual novelty

The paper’s main novelty is the proposed “visual assistant” skill: a set of behaviors for goal-directed, reliable conversational help, paired with guidelines for MLLM-enabled visual interpretation applications. The contribution is not just a new app study, but a reframing of what these systems must do beyond description.

Evidence

A two-week diary study with 20 BLV participants using an MLLM-enabled visual interpretation application produced 446 complete diary entries and 375 conversations covering 549 answerable questions. Participants rated interpretations as trustworthy and somewhat satisfying, yet the system still produced incorrect answers and abstentions at notable rates. The paper then proposes the “visual assistant” skill and guidelines, while also acknowledging several study limitations.

“ We conclude by proposing the "visual assistant" skill and guidelines to help MLLM-enabled visual interpretation applications better support BLV people’s access to visual information”

actual novelty · Abstract / Discussion / Conclusion framing · confidence 0.80

“ We found that participants were “somewhat satisfied” with the information generated by the AI and found it “trustworthy,” with mean satisfaction and trust ratings of ”

departure from common sense · Abstract · confidence 0.77

“ As these months constitute a busy holiday season in the USA, some participants were traveling during the diary study—leading to possibly atypical use cases for our tool which may not be as prevalent outside of this time frame”

limitation · Limitations section · confidence 0.83

“ To address this, we conducted a two-week diary study, where we captured 20 BLV participants’ use of an MLLM-enabled visual interpretation application”

validation scope · Abstract / Method · confidence 0.86

Limits

Method limits

The evidence comes from a two-week diary study rather than a controlled comparison, so the findings are grounded in real-world use but not in causal attribution or broad comparative performance claims. The participant pool was also limited to expert users of visual interpretation systems, which narrows interpretive scope.

Deployment limits

The paper itself notes that observer effect and awareness of recording may have influenced behavior, and that the sample excluded beginners who may face different access barriers. The October–December collection window may also have introduced atypical holiday-travel use cases, limiting direct transfer to other periods and populations.

Boundary conditions

Findings are most applicable to BLV expert users of visual interpretation systems using conversational MLLM tools in everyday contexts. The reported trust/satisfaction pattern and the proposed assistant skill should be read in light of diary-study self-report, system recording, and the specific GPT-4o-based application context.

Position in field

This paper sits at the intersection of accessible AI, BLV assistive technology, and HCI studies of LLM-mediated interaction. Its field contribution is to shift evaluation from description quality alone toward goal-directed assistance quality in everyday visual access.

Abstract

Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information. Unlike traditional visual interpretation tools that only provide descriptions, MLLM-enabled applications offer conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and implications for BLV people's daily lives remains limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants' use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as "trustworthy" (mean=3.76 out of 5, max=extremely trustworthy) and "somewhat satisfying" (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to users' requests. Our findings show that while MLLMs can improve visual interpretations' descriptive accuracy, supporting everyday use also depends on the “visual assistant” skill: behaviors for providing goal-directed, reliable assistance. We conclude by proposing the "visual assistant" skill and guidelines to help MLLM-enabled visual interpretation applications better support BLV people's access to visual information.