Sensemaking in User-Driven Algorithm Auditing: A Case Study on Gender Bias in an Image Captioning Model
This best paper makes a strong HCI contribution by showing that non-experts can do meaningful algorithm auditing when interfaces are designed around sensemaking. Its most important insight is that interface affordances actively shape which harms become visible, not just how efficiently users inspect outputs.
Video Figure
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- descriptive knowledge typical · 92/268
- Novelty type
- framework typical · 59/268
- Abstraction level
- practice typical · 85/268
- Generalization target
- task class typical · 63/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
This paper stands out because it reframes algorithm auditing as an interpretive, interface-mediated practice rather than a purely technical evaluation task reserved for experts. The authors do not simply claim that non-experts matter; they build three concrete interface conditions grounded in sensemaking and then show, through a 60-participant between-subjects study, that these designs lead users toward different forms of evidence and different kinds of bias hypotheses. That is the real contribution: the work demonstrates that auditing outcomes are partly produced by the structure of the interface itself. The masking tool supports localized probing of visual cues and context, while the filtering tool helps users compare outputs at a broader linguistic level and notice asymmetries in how men and women are described. The thematic analysis further shows that participants surfaced multiple intertwined forms of gender bias rather than a single simplistic stereotype effect. At the same time, the paper is appropriately bounded. Its dataset is fixed and pre-selected around occupational gender bias, the participant pool was not intentionally constructed to foreground marginalized perspectives, and the study analyzes final audit products more than the temporal process of sensemaking. Those limitations matter because they constrain claims about broader harms, collaborative auditing, and real-world deployment. Even so, the paper makes a substantial contribution to HCI and responsible AI by showing that interface design is epistemically consequential in auditing: it changes what users can see, compare, hypothesize, and ultimately report as harm.
What Changed
Canon before
Prior work assumes algorithm auditing is mainly an expert-driven process that focuses on performance metrics or guided explanations. Audit tools typically constrain user agency and focus on isolated bias patterns rather than supporting open-ended, non-expert sensemaking. Gender bias in image captioning has been studied via benchmarks but not through user audits revealing how biases interact or are interpreted socially.
Departure from common sense
The paper argues against the idea that meaningful algorithm auditing must remain expert-led. With sensemaking-oriented interfaces, non-expert participants were able to surface multiple interrelated forms of gender bias and produce structured hypotheses and evidence. It also shows that interface affordances do not merely assist auditing efficiency; they materially shape which bias patterns users notice and how they interpret them.
Actual novelty
The main contribution is a sensemaking-grounded design and empirical comparison of three user-driven auditing interfaces for image captioning bias: a baseline interface, an image masking tool, and a text filtering tool. The novelty is not just the tools themselves, but the demonstration that different interface supports systematically steer non-expert auditors toward different kinds of bias findings, spanning visual-cue testing, broader linguistic asymmetries, and confidence formation.
Evidence
The paper is grounded in a between-participants study with 60 participants across three interface conditions, combined with thematic analysis of participants’ bias cards and quantitative comparisons across conditions. Evidence supports both the existence of multiple discovered bias themes and the claim that interface design shaped audit outcomes and confidence.
“We designed a web-based interface that scaffolds the key stages of the sensemaking process as described in Cabrera et al. [4] (and depicted in Figure 1), including interpreting instances and outputs, aggregating them into schemas and generating hypotheses, and organizing evidence.”
actual novelty · 3.3 Auditing interfaces · confidence 0.93
“ A Text Filtering Tool enabled participants to query the dataset based on words present in the output captions, supporting comparisons across similar instances. We performed a mixed-methods analysis of the audit outcomes. Our findings show that non-expert participants uncovered a range of gender biases in the model’s behavior, including th”
departure from common sense · 1 Introduction · confidence 0.96
“While this study offers empirical insights into how non-expert users audit algorithmic behavior, it suffers several limitations. First, participants audited a fixed set of 80 images pre-selected for gender bias in occupational contexts, which oriented sensemaking toward occupational stereotypes while excluding other forms of harm, such as intersectional dynamics of race, age, or ability”
limitation · 5.4 Limitations and Future Directions · confidence 0.97
“4 Study Design The study follows a between-participants design with three conditions: baseline : auditing with the Baseline interface (original interface design, Figure 2) masking: auditing with the baseline interface augmented by the Image Masking Tool (Figure 3) filtering : auditing with the baseline interface augmented by the Text Filtering Tool (Figure 4).”
validation scope · 3.4 Study Design · confidence 0.95
Limits
Method limits
The study used a fixed set of 80 pre-selected occupational images centered on gender bias, which narrows the harms participants could surface. The sample did not intentionally include marginalized communities, and the analysis emphasized final audit artifacts rather than detailed trajectories through sensemaking.
Deployment limits
The evaluation was conducted in a controlled, individual study setting rather than embedded in everyday AI use. The interfaces were not tested in collaborative or longitudinal auditing contexts, so practical performance in situated real-world auditing remains uncertain.
Boundary conditions
Findings are bounded to non-expert auditing of gender bias in an image captioning model using curated occupational images and three specific interface conditions. Generalization to other harms, domains, datasets, or naturally occurring audit settings should be made cautiously.
Position in field
This paper extends user-driven algorithm auditing by operationalizing sensemaking as an interface design principle and showing how concrete affordances shape what non-experts can discover. It connects HCI theories of sensemaking with practical AI auditing support, moving the field beyond expert-only audits and beyond tools focused mainly on explanation or performance inspection.