CHI '26 · Best paper · full-paper review · confidence high

Writing with AI Can Reduce Gender Bias in Hiring Evaluations

Alicia T.H. Liu , Mina Lee , Xuechunzi Bai

This is a strong CHI contribution because it turns autocomplete from a productivity feature into a causal intervention on evaluative language, then shows both benefit and cost. The paper is especially valuable for refusing a simplistic debiasing story: competence and salary outcomes improve, but warmth and affiliative judgments can worsen.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: causal knowledge typical · 31/268
Novelty type: interaction technique less common · 7/268
Abstraction level: task typical · 36/268
Generalization target: task class typical · 63/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: low typical · 53/268

Review Summary

This paper stands out because it identifies a very plausible intervention point for bias mitigation: not post hoc auditing, not awareness training, and not abstract exhortations to be fair, but the moment-by-moment language production process through which evaluations are actually written. That is a meaningful HCI move. The authors use an autocomplete-style writing assistant to inject counter-stereotypical descriptors while participants evaluate candidates, and the experiment shows that these suggestions are not merely cosmetic. They shift downstream judgments in consequential ways, including competence-related impressions, trusted-leader judgments, and salary recommendations. The design is also theoretically grounded in stereotype content work, so the manipulation is not arbitrary; it targets the competence-versus-warmth structure that often organizes gendered evaluation. Just as important, the paper does not overstate what happened. Hiring choice itself remained only directionally improved, and the same intervention that helped Jennifer on competence-linked dimensions also reduced warmth and enjoyment-of-working-with judgments. That backlash-like pattern is arguably the most important result, because it shows that changing one evaluative dimension can worsen another. In other words, the paper contributes more than a debiasing trick; it contributes a sharper understanding of the trade-offs involved when AI systems reshape social description. The limitations section further improves credibility by being explicit about the prototype nature of the system, the use of online non-recruiter participants, the single occupation, the binary-gender framing, and the somewhat artificial requirement that users view suggestions repeatedly. So the right reading is not that AI writing assistants solve hiring bias, but that they open a credible and testable design space for language-level interventions in high-stakes evaluation. That is a substantial contribution for CHI because it combines interaction design, social computing, and causal evidence while remaining appropriately cautious about deployment.

What Changed

Canon before

Existing stereotype interventions focus on raising awareness or providing counter-stereotypical role models, assuming that reflection or exposure will produce behavioral change, but they often produce mixed or limited results and are cognitively costly. Language use perpetuates stereotypes and changing language has high cognitive and social costs. AI writing assistants have not yet been purposefully designed to reduce gender bias through subtle, language-level interventions in high-stakes decision-making contexts.

Departure from common sense

Counter-stereotypical AI-generated writing suggestions can increase perceptions of competence of female candidates and reduce salary gaps, yet can simultaneously reduce their likability and not significantly increase hiring rates, revealing a backlash effect inconsistent with assumptions that increasing perceived competence straightforwardly improves hiring outcomes.

Actual novelty

This study presents a novel intervention using AI autocomplete writing assistants to subtly shift evaluative language towards counter-stereotypical competence traits during hiring evaluations, demonstrating that such linguistic nudges can measurably influence written evaluations, trait impressions, salary recommendations, and affiliative judgments in a large preregistered experiment, revealing nuanced trade-offs including gender backlash effects.

Evidence

Strong evidence from a preregistered online experiment with 672 participants shows that AI autocomplete suggestions changed the language used in résumé evaluations, which in turn shifted competence, leadership, warmth, and salary-related judgments. The paper also explicitly reports backlash-like effects and substantial ecological and generalizability limits, so the central causal claim is persuasive but bounded by the simulated hiring setting.

“We conducted an online experiment (N = 672) in which participants viewed two résumés, one female (“Jennifer”) and one male (“John”), and wrote short evaluations with the help of an autocomplete tool. When writing about the female candidate, we configured the writing assistant to generate either gender neutral, stereotypical, or counter-stereotypical completions, while ”

actual novelty · 3 Methods · confidence 0.98

“ decreasing warmth-related evaluations. Overall, these results suggest that counter-stereotypical suggestions improved Jennifer’s standing as a trusted leader, suggesting shifts in participants’ written evaluations carried over into measurable and consistent changes in their attitudes. However, these suggestions simultaneously reduced affiliat”

departure from common sense · 4.3 Did the AI writing assistant affect participants’ impressions of the candidates? · confidence 0.97

“ The design focused on binary gender, a single job type, and one AI system, which also limits generalizability. As our intervention is rooted in stereotype content model and the warmth-competence dimension that underlie universal social impression formation, we are hopeful that it may be successfully extended towards other stereotypes (e.g., race, age, nationality).”

limitation · 5.4 Limitations and future directions · confidence 0.99

“We conducted an online experiment (N = 672) in which participants viewed two résumés, one female (“Jennifer”) and one male (“John”), and wrote short evaluations with the help of an autocomplete tool. When writing about the female candidate, we configured the writing assistant to generate either gender neut”

validation scope · 3 Methods · confidence 0.96

Limits

Method limits

The study used online participants rather than professional recruiters, manipulated suggestions only for the woman candidate, focused on a single Financial Analyst role, and constrained participants to short evaluations under time pressure. It also required participants to view suggestions at least eight times, which improves control but may not reflect naturalistic autocomplete use.

Deployment limits

The paper notes open governance questions about who decides which language patterns to counter, how model behavior is monitored, and how responsibility is assigned when biased outcomes still occur. It also emphasizes that adoption of AI-assisted writing varies across organizations and some hiring workflows may not centrally involve writing interfaces where autocomplete can intervene.

Boundary conditions

Effects were demonstrated in a controlled online hiring simulation with binary gender, one job type, one AI system, and neutral suggestions for the male candidate. The authors explicitly note that longer-form writing, other occupations, different user populations, real organizational workflows, and repeated longitudinal exposure may produce different effects, including persistence or backlash.

Position in field

The paper extends stereotype-intervention research by moving from explicit awareness or role-model approaches to an embedded language-production intervention. Within HCI, it contributes evidence that AI writing assistants can function as equity-oriented interaction techniques, while also showing that such interventions can redistribute impressions across competence and warmth rather than simply eliminating bias.

Abstract

Women remain underrepresented in the workplace, partly due to stereotypes associating competence traits with men rather than women. Efforts to change such stereotypes often yield mixed results. As language models become integrated into daily life, AI writing assistants offer an opportunity to shift gender images. In a preregistered experiment (N=672), participants evaluated résumés for a female ("Jennifer") and a male ("John") candidate applying to a financial analyst role. They wrote evaluations using AI-generated suggestions in one of three conditions: suggestions for Jennifer integrated stereotypically male, female, or neutral traits. Suggestions for John remained neutral. Participants exposed to male-trait suggestions evaluated Jennifer as more competent, selected her as the leader, and offered higher salaries. However, we also observed signs of backlash: participants were less willing to work with competent Jennifer. We discuss implications for designing AI writing assistants to mitigate gender bias in hiring contexts.