Red Teaming LLMs as Socio-Technical Practice: From Exploration and Data Creation to Evaluation
This is a solid CHI honorable-mention style contribution: it does not propose a new red-teaming algorithm, but it usefully reframes red teaming as a socio-technical practice and backs that reframing with 22 practitioner interviews. The paper’s value is in exposing how dataset scope and evaluation standards are socially constructed, while its limits are the usual ones for interview-based evidence.
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- descriptive knowledge typical · 92/268
- Novelty type
- empirical finding typical · 68/268
- Abstraction level
- practice typical · 85/268
- Generalization target
- field argument typical · 55/268
- Validation mode
- qualitative study typical · 63/268
Evidence profile
- Evidence strength
- moderate typical · 105/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
This paper’s strongest contribution is conceptual and empirical rather than technical. It takes a domain that is often discussed through benchmark metrics, attack success rates, and model scoring, and shows that the real work of red teaming is also about how practitioners define the dataset, decide what counts as harmful, and choose what kinds of interactions and users are represented. That is a meaningful CHI move because it relocates attention from the model alone to the practices that make evaluation possible. The abstract explicitly states that most existing work emphasizes technical benchmarks and attack success rates, while the paper examines the socio-technical practices of how red teaming datasets are defined, created, and evaluated. The interview study design is appropriate for that question: 22 semi-structured interviews, each 40 to 60 minutes, is enough to surface recurring practices, tensions, and vocabulary, but not enough to support broad causal claims about effectiveness or safety outcomes. So the paper is best understood as descriptive knowledge about a practice area, with field-level implications for how red teaming should be conceptualized. The main limitation is that the evidence supports practitioner perspectives and workflow patterns, not direct comparisons of red-teaming methods or measured improvements in model safety. In other words, the paper convincingly argues that dataset practices shape evaluation scope and accuracy, but it does not empirically prove which red-teaming approach is superior. That said, for CHI this is still a valuable contribution because it identifies a blind spot in current AI safety discourse: context, interaction type, and user specificity can be overlooked when evaluation is reduced to generic risk scoring. The paper therefore reads as a strong honorable-mention candidate: important, timely, and well aligned with HCI’s interest in socio-technical systems, while remaining appropriately bounded by qualitative evidence and practitioner self-report.
What Changed
Canon before
Prior CHI and adjacent work on LLM safety/red teaming has largely treated red teaming as a technical evaluation problem centered on benchmarks, attack success, and model scoring rather than as a situated socio-technical practice of dataset definition, creation, and evaluation.
Departure from common sense
The paper argues that red teaming should not be understood only as a technical benchmark or attack-success exercise; instead, the dataset practices themselves shape what counts as harm and what gets evaluated. That reframes evaluation as a socio-technical construction rather than a neutral measurement step.
Actual novelty
Its main novelty is empirical: 22 practitioner interviews are used to surface how red teaming datasets are conceptualized, created, and evaluated, and how practitioners’ backgrounds and risk framings influence what is included or omitted. The contribution is a practice-level account of red teaming work, not a new algorithm or benchmark.
Evidence
The paper’s evidence base is a qualitative interview study of 22 AI practitioners, each interviewed for 40 to 60 minutes, focused on how red teaming datasets are designed and evaluated. The abstract and findings support the claim that dataset practices are central to evaluation scope and that current approaches can overlook context, interaction type, and user specificity. The evidence is well aligned with a descriptive/practice contribution, but it does not directly validate downstream safety outcomes or compare alternative red-teaming methods experimentally.
“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Recently, red teaming, with roots in security, has become a key evaluative approach to e”
actual novelty · Abstract · confidence 0.78
“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Recently, red teaming, with roots in security, has become a key evaluative approach to e”
departure from common sense · Abstract/Introduction framing · confidence 0.66
“ Digital Library Google Scholar [54] Michael Muller and Angelika Strohmayer. 2022. Forgetting Practices in the Data Sciences. In ACM Conferences . Association for Computing Machinery, New York, NY, USA, 1–19”
limitation · Findings 4.3.2 · confidence 0.74
“3 Interview Study We conducted 22 semi-structured interviews with AI practitioners, and each interview took between 40 to 60 minutes”
validation scope · Methods · confidence 0.60
Limits
Method limits
The study relies on semi-structured interviews rather than direct observation, artifact analysis at scale, or outcome-based evaluation of model safety. As a result, it supports claims about practitioner conceptualizations and workflows more strongly than claims about the effectiveness of any specific red-teaming practice.
Deployment limits
The findings are most applicable to organizations and teams already doing LLM red teaming or adjacent safety evaluation work. They are less directly transferable to settings without dedicated evaluation infrastructure, to non-LLM systems, or to contexts where red-teaming is governed by different regulatory or operational constraints.
Boundary conditions
The conclusions are bounded by the interviewed practitioners’ perspectives and by the current state of red-teaming practice, including reliance on evaluators, classifiers, and LLM judges. The paper’s implications are strongest where dataset scope, context coverage, and user specificity are under active design decisions.
Position in field
This sits in CHI’s growing socio-technical AI safety literature by shifting attention from benchmark performance to the labor, standards, and judgment practices that produce evaluation datasets. It is best read as a field-level reframing of red teaming rather than a technical advance in attack generation or scoring.