CHI '26 · Honorable mention · full-paper review · confidence medium-high

Red Teaming LLMs as Socio-Technical Practice: From Exploration and Data Creation to Evaluation

Adriana Alvarado Garcia , Ruyuan Wan , Ozioma Collins Oguine , Karla Badillo-Urquiola

This is a solid CHI honorable-mention style contribution: it does not propose a new red-teaming algorithm, but it usefully reframes red teaming as a socio-technical practice and backs that reframing with 22 practitioner interviews. The paper’s value is in exposing how dataset scope and evaluation standards are socially constructed, while its limits are the usual ones for interview-based evidence.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: empirical finding typical · 68/268
Abstraction level: practice typical · 85/268
Generalization target: field argument typical · 55/268
Validation mode: qualitative study typical · 63/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s strongest contribution is conceptual and empirical rather than technical. It takes a domain that is often discussed through benchmark metrics, attack success rates, and model scoring, and shows that the real work of red teaming is also about how practitioners define the dataset, decide what counts as harmful, and choose what kinds of interactions and users are represented. That is a meaningful CHI move because it relocates attention from the model alone to the practices that make evaluation possible. The abstract explicitly states that most existing work emphasizes technical benchmarks and attack success rates, while the paper examines the socio-technical practices of how red teaming datasets are defined, created, and evaluated. The interview study design is appropriate for that question: 22 semi-structured interviews, each 40 to 60 minutes, is enough to surface recurring practices, tensions, and vocabulary, but not enough to support broad causal claims about effectiveness or safety outcomes. So the paper is best understood as descriptive knowledge about a practice area, with field-level implications for how red teaming should be conceptualized. The main limitation is that the evidence supports practitioner perspectives and workflow patterns, not direct comparisons of red-teaming methods or measured improvements in model safety. In other words, the paper convincingly argues that dataset practices shape evaluation scope and accuracy, but it does not empirically prove which red-teaming approach is superior. That said, for CHI this is still a valuable contribution because it identifies a blind spot in current AI safety discourse: context, interaction type, and user specificity can be overlooked when evaluation is reduced to generic risk scoring. The paper therefore reads as a strong honorable-mention candidate: important, timely, and well aligned with HCI’s interest in socio-technical systems, while remaining appropriately bounded by qualitative evidence and practitioner self-report.

What Changed

Canon before

Prior CHI and adjacent work on LLM safety/red teaming has largely treated red teaming as a technical evaluation problem centered on benchmarks, attack success, and model scoring rather than as a situated socio-technical practice of dataset definition, creation, and evaluation.

Departure from common sense

The paper argues that red teaming should not be understood only as a technical benchmark or attack-success exercise; instead, the dataset practices themselves shape what counts as harm and what gets evaluated. That reframes evaluation as a socio-technical construction rather than a neutral measurement step.

Actual novelty

Its main novelty is empirical: 22 practitioner interviews are used to surface how red teaming datasets are conceptualized, created, and evaluated, and how practitioners’ backgrounds and risk framings influence what is included or omitted. The contribution is a practice-level account of red teaming work, not a new algorithm or benchmark.

Evidence

The paper’s evidence base is a qualitative interview study of 22 AI practitioners, each interviewed for 40 to 60 minutes, focused on how red teaming datasets are designed and evaluated. The abstract and findings support the claim that dataset practices are central to evaluation scope and that current approaches can overlook context, interaction type, and user specificity. The evidence is well aligned with a descriptive/practice contribution, but it does not directly validate downstream safety outcomes or compare alternative red-teaming methods experimentally.

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Recently, red teaming, with roots in security, has become a key evaluative approach to e”

actual novelty · Abstract · confidence 0.78

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Recently, red teaming, with roots in security, has become a key evaluative approach to e”

departure from common sense · Abstract/Introduction framing · confidence 0.66

“ Digital Library Google Scholar [54] Michael Muller and Angelika Strohmayer. 2022. Forgetting Practices in the Data Sciences. In ACM Conferences . Association for Computing Machinery, New York, NY, USA, 1–19”

limitation · Findings 4.3.2 · confidence 0.74

“3 Interview Study We conducted 22 semi-structured interviews with AI practitioners, and each interview took between 40 to 60 minutes”

validation scope · Methods · confidence 0.60

Limits

Method limits

The study relies on semi-structured interviews rather than direct observation, artifact analysis at scale, or outcome-based evaluation of model safety. As a result, it supports claims about practitioner conceptualizations and workflows more strongly than claims about the effectiveness of any specific red-teaming practice.

Deployment limits

The findings are most applicable to organizations and teams already doing LLM red teaming or adjacent safety evaluation work. They are less directly transferable to settings without dedicated evaluation infrastructure, to non-LLM systems, or to contexts where red-teaming is governed by different regulatory or operational constraints.

Boundary conditions

The conclusions are bounded by the interviewed practitioners’ perspectives and by the current state of red-teaming practice, including reliance on evaluators, classifiers, and LLM judges. The paper’s implications are strongest where dataset scope, context coverage, and user specificity are under active design decisions.

Position in field

This sits in CHI’s growing socio-technical AI safety literature by shifting attention from benchmark performance to the labor, standards, and judgment practices that produce evaluation datasets. It is best read as a field-level reframing of red teaming rather than a technical advance in attack generation or scoring.

Abstract

Recently, red teaming, with roots in security, has become a key evaluative approach to ensure the safety and reliability of Generative Artificial Intelligence. However, most existing work emphasizes technical benchmarks and attack success rates, leaving the socio-technical practices of how red teaming datasets are defined, created, and evaluated under-examined. Drawing on 22 interviews with practitioners who design and evaluate red teaming datasets, we examine the data practices and standards that underpin this work. Because adversarial datasets determine the scope and accuracy of model evaluations, they are critical artifacts for assessing potential harms from large language models. Our contributions are first, empirical evidence of practitioners conceptualizing red teaming and developing and evaluating red teaming datasets. Second, we reflect on how practitioners’ conceptualization of risk leads to overlooking the context, interaction type, and user specificity. We conclude with three opportunities for HCI researchers to expand the conceptualization and data practices for red-teaming.