CHI '26 · Honorable mention · full-paper review · confidence medium

Gold Standard or Gold-Plated? Human Practices of Triple Verification in CSAM Takedown

Melissa Rottier , Michel van Eeten , Savvas Zannettou

This is a strong CHI-style empirical critique of a taken-for-granted safety procedure. Its value is not a new interface or algorithm, but a careful unpacking of how triple verification actually works in practice and when it fails to behave like a simple reliability guarantee.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: empirical finding typical · 68/268
Abstraction level: practice typical · 85/268
Generalization target: organizational context typical · 20/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: medium typical · 32/268
Overclaim risk: medium typical · 210/268

Review Summary

The paper’s main contribution is to destabilize a common operational assumption in online safety: that adding more human reviewers automatically yields a more trustworthy classification process. The title already signals that tension, and the abstract shows the authors backing it with a mixed-methods design that is well matched to the question. Interviews with experts from multiple organizations map how triple verification is implemented and perceived, while the experiment with Dutch National Police experts tests whether agreement is stable across blind versus non-blind conditions and different ordering. The focus group then helps explain why disagreements arise. That combination matters because it moves the paper beyond a purely normative critique or a purely statistical reliability report; it connects organizational practice, expert judgment, and condition-dependent agreement. For CHI, that is a meaningful contribution in the space of content moderation, trust, and human decision-making under high-stakes constraints. At the same time, the evidence available here is mostly abstract-level, so the exact strength of the experimental claims, the statistical treatment, and the boundary conditions are not fully inspectable from the supplied text alone. I would therefore read the paper as a solid empirical and conceptual intervention whose novelty lies in showing that triple verification is not a universal gold standard but a situated practice whose reliability depends on workflow design and content type.

What Changed

Canon before

Triple verification is commonly treated as a straightforward safeguard for CSAM classification; the paper questions that assumption by foregrounding implementation variability and reliability dependence on conditions.

Departure from common sense

The title itself signals a challenge to the default belief that requiring three reviewers automatically produces a dependable gold standard. Framing the practice as potentially "gold-plated" suggests added procedure does not necessarily guarantee better judgment or reliability.

Actual novelty

The paper’s novelty is in combining a mapping of real-world triple-verification practices with an inter-reliability experiment and a focus group, so it does not just describe policy in the abstract but examines how voting conditions and expert perceptions shape agreement in CSAM takedown work.

Evidence

The supplied metadata supports a mixed-methods contribution: interviews with 14 experts from seven organizations, an inter-reliability experiment with Dutch National Police experts reviewing 2,031 images and videos under blind/non-blind and order variations, and a focus group to explain disagreements. The abstract also states that practices vary widely and agreement depends on voting conditions and content type.

“Gold Standard or Gold-Plated? Human Practices of Triple Verification in CSAM Takedown | Proceedings of the 2026 CHI Conference on Human Factors in Computing System”

actual novelty · Abstract · confidence 0.78

“Gold Standard or Gold-Plated? Human Practices of Triple Verification in CSAM Takedown | Proceedings of the 2026 CHI Conference on Human Factors in Computing System”

departure from common sense · Front Matter · confidence 0.40

“Gold Standard or Gold-Plated? Human Practices of Triple Verification in CSAM Takedown | Proceedings of the 2026 CHI Conference on Human Factors in Computing System”

limitation · No Discussion/Limitations/Conclusion/Future Work sections were provided in the focused sections · confidence 0.72

“Gold Standard or Gold-Plated? Human Practices of Triple Verification in CSAM Takedown | Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems ”

validation scope · Abstract · confidence 0.86

Limits

Method limits

The provided text does not include the paper’s full methods, statistical details, or discussion, so the exact experimental design, measures, and robustness checks cannot be verified here. The evidence available is enough to identify the study type and scope, but not enough to assess all methodological choices.

Deployment limits

The findings are situated in CSAM takedown workflows and expert review settings, so transfer to other moderation or safety domains may be limited by organizational policy, legal constraints, reviewer training, and content-specific judgment demands.

Boundary conditions

The abstract indicates that agreement depends on voting conditions and content type, implying that triple verification is not uniformly reliable across all review contexts. The title and abstract together suggest the practice should be evaluated as a context-sensitive organizational procedure rather than a universal safeguard.

Position in field

This reads as a critical CHI paper on trust, moderation labor, and safety operations: it interrogates a widely assumed verification norm, then grounds that critique in mixed-method evidence from practitioners and an experiment. The contribution is more diagnostic than prescriptive, with relevance to content moderation governance and human-in-the-loop safety review.

Abstract

Child sexual abuse material (CSAM) presents a critical challenge for online safety, yet the verification procedures that determine which items are classified as CSAM remain poorly understood. Triple verification (requiring three reviewers to agree) is promoted as a safeguard, but little is known about how it is implemented, how it is perceived by experts, and how voting conditions affect reliability. We address this gap through a mixed-methods study. We interviewed 14 experts from seven organizations (e.g., law enforcement, hotlines, etc.) to map current verification practices, then ran an inter-reliability experiment with Dutch National Police experts who reviewed 2,031 images and videos under different voting conditions (blind vs. non-blind, varied order). Finally, we held a focus group to explore the reasons behind disagreements. We find that practices vary widely, perceptions of triple verification reflect both safeguards and burdens, and expert agreement depends on voting conditions and content type.