CHI '26 · Best paper · full-paper review · confidence high

Sensemaking in User-Driven Algorithm Auditing: A Case Study on Gender Bias in an Image Captioning Model

Behnoosh Mohammadzadeh , Jules Françoise , michele gouiffes , Baptiste Caramiaux

This best paper makes a strong HCI contribution by showing that non-experts can do meaningful algorithm auditing when interfaces are designed around sensemaking. Its most important insight is that interface affordances actively shape which harms become visible, not just how efficiently users inspect outputs.

Video Figure

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: framework typical · 59/268
Abstraction level: practice typical · 85/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper stands out because it reframes algorithm auditing as an interpretive, interface-mediated practice rather than a purely technical evaluation task reserved for experts. The authors do not simply claim that non-experts matter; they build three concrete interface conditions grounded in sensemaking and then show, through a 60-participant between-subjects study, that these designs lead users toward different forms of evidence and different kinds of bias hypotheses. That is the real contribution: the work demonstrates that auditing outcomes are partly produced by the structure of the interface itself. The masking tool supports localized probing of visual cues and context, while the filtering tool helps users compare outputs at a broader linguistic level and notice asymmetries in how men and women are described. The thematic analysis further shows that participants surfaced multiple intertwined forms of gender bias rather than a single simplistic stereotype effect. At the same time, the paper is appropriately bounded. Its dataset is fixed and pre-selected around occupational gender bias, the participant pool was not intentionally constructed to foreground marginalized perspectives, and the study analyzes final audit products more than the temporal process of sensemaking. Those limitations matter because they constrain claims about broader harms, collaborative auditing, and real-world deployment. Even so, the paper makes a substantial contribution to HCI and responsible AI by showing that interface design is epistemically consequential in auditing: it changes what users can see, compare, hypothesize, and ultimately report as harm.

What Changed

Canon before

Prior work assumes algorithm auditing is mainly an expert-driven process that focuses on performance metrics or guided explanations. Audit tools typically constrain user agency and focus on isolated bias patterns rather than supporting open-ended, non-expert sensemaking. Gender bias in image captioning has been studied via benchmarks but not through user audits revealing how biases interact or are interpreted socially.

Departure from common sense

The paper argues against the idea that meaningful algorithm auditing must remain expert-led. With sensemaking-oriented interfaces, non-expert participants were able to surface multiple interrelated forms of gender bias and produce structured hypotheses and evidence. It also shows that interface affordances do not merely assist auditing efficiency; they materially shape which bias patterns users notice and how they interpret them.

Actual novelty

The main contribution is a sensemaking-grounded design and empirical comparison of three user-driven auditing interfaces for image captioning bias: a baseline interface, an image masking tool, and a text filtering tool. The novelty is not just the tools themselves, but the demonstration that different interface supports systematically steer non-expert auditors toward different kinds of bias findings, spanning visual-cue testing, broader linguistic asymmetries, and confidence formation.

Evidence

The paper is grounded in a between-participants study with 60 participants across three interface conditions, combined with thematic analysis of participants’ bias cards and quantitative comparisons across conditions. Evidence supports both the existence of multiple discovered bias themes and the claim that interface design shaped audit outcomes and confidence.

“We designed a web-based interface that scaffolds the key stages of the sensemaking process as described in Cabrera et al. [4] (and depicted in Figure 1), including interpreting instances and outputs, aggregating them into schemas and generating hypotheses, and organizing evidence.”

actual novelty · 3.3 Auditing interfaces · confidence 0.93

“ A Text Filtering Tool enabled participants to query the dataset based on words present in the output captions, supporting comparisons across similar instances. We performed a mixed-methods analysis of the audit outcomes. Our findings show that non-expert participants uncovered a range of gender biases in the model’s behavior, including th”

departure from common sense · 1 Introduction · confidence 0.96

“While this study offers empirical insights into how non-expert users audit algorithmic behavior, it suffers several limitations. First, participants audited a fixed set of 80 images pre-selected for gender bias in occupational contexts, which oriented sensemaking toward occupational stereotypes while excluding other forms of harm, such as intersectional dynamics of race, age, or ability”

limitation · 5.4 Limitations and Future Directions · confidence 0.97

“4 Study Design The study follows a between-participants design with three conditions: baseline : auditing with the Baseline interface (original interface design, Figure 2) masking: auditing with the baseline interface augmented by the Image Masking Tool (Figure 3) filtering : auditing with the baseline interface augmented by the Text Filtering Tool (Figure 4).”

validation scope · 3.4 Study Design · confidence 0.95

Limits

Method limits

The study used a fixed set of 80 pre-selected occupational images centered on gender bias, which narrows the harms participants could surface. The sample did not intentionally include marginalized communities, and the analysis emphasized final audit artifacts rather than detailed trajectories through sensemaking.

Deployment limits

The evaluation was conducted in a controlled, individual study setting rather than embedded in everyday AI use. The interfaces were not tested in collaborative or longitudinal auditing contexts, so practical performance in situated real-world auditing remains uncertain.

Boundary conditions

Findings are bounded to non-expert auditing of gender bias in an image captioning model using curated occupational images and three specific interface conditions. Generalization to other harms, domains, datasets, or naturally occurring audit settings should be made cautiously.

Position in field

This paper extends user-driven algorithm auditing by operationalizing sensemaking as an interface design principle and showing how concrete affordances shape what non-experts can discover. It connects HCI theories of sensemaking with practical AI auditing support, moving the field beyond expert-only audits and beyond tools focused mainly on explanation or performance inspection.

Abstract

Non-experts increasingly engage in user-driven algorithm auditing, interacting directly with AI systems to probe, document, and reflect on biased behavior. Yet, auditing remains challenging due to model opacity and limited support for navigating and interpreting outputs. This paper explores the design and evaluation of interfaces grounded in the sensemaking framework to support non-experts in auditing gender bias in image captioning. In a between-subjects study, 60 participants audited an image captioning model using one of three interface conditions: a Baseline interface, a Masking Tool for image manipulation, or a Filtering Tool for organizing captions. Our findings show that interface design shaped what participants noticed, how they interpreted model behavior, and supported their hypotheses. The Image Masking Tool enabled fine-grained testing of visual cues and context, while the Text Filtering Tool revealed broader asymmetries in gendered language. We argue that incorporating sensemaking into auditing practices can advance accountability and transparency in machine learning systems.