CHI '26 · Best paper · full-paper review · confidence high

Chasing Meaning and/or Insight? A Survey on Evaluation Practices at the Intersection of Visualization and the Humanities

Alejandro Benito-Santos , Florian Windhager , Aida Horaniet Ibañez , Rabea Kleymann , Alfie Abdul-Rahman , Eva Mayr

A standout contribution because it does not just complain that VIS*H evaluation is hard; it shows, with a broad survey and workflow analysis, where the field is methodologically stuck and why richer triangulation matters. Its strongest move is reframing rigor as a matter of evidence composition and epistemic fit, not just familiar validation rituals.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: method knowledge typical · 29/268
Novelty type: synthesis typical · 16/268
Abstraction level: field typical · 41/268
Generalization target: field argument typical · 55/268
Validation mode: survey synthesis typical · 10/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: low typical · 53/268

Review Summary

This paper is impressive because it tackles a diffuse methodological problem at the scale of a field rather than at the scale of a single tool. The authors survey 171 VIS*H design studies and use that corpus to make a sharper argument than the usual call for “more evaluation.” Their evidence suggests that the field has settled into a methodological comfort zone dominated by case studies and interviews, but that popularity does not equal rigor. The paper’s real value is that it separates three things that are often conflated: the number of methods used, the specific kinds of methods used, and the broader epistemic question of what counts as valid evidence in humanities-facing visualization work. By combining descriptive coding with regression and workflow clustering, the authors show that stronger evaluations tend to emerge from strategically triangulated, evidence-rich combinations rather than from monomethod routines. Just as importantly, they do not stop at a procedural recommendation. They connect these empirical findings to deeper tensions between positivist validation norms in visualization and interpretivist, discursive traditions in the humanities. That makes the paper useful both as a practical benchmark for researchers and reviewers and as a conceptual intervention in how VIS*H should judge quality. The limitations are real: the coding lens is explicitly situated, and the corpus reflects published papers rather than all practice. But the authors acknowledge that situatedness directly, which strengthens rather than weakens the contribution. Overall, this is a benchmark synthesis that gives the field a vocabulary, an empirical baseline, and a more defensible account of what better evaluation could look like.

What Changed

Canon before

The dominant baseline in visualization evaluation assumes that rigorous methods such as controlled experiments, case studies, or interviews assess analytical utility and communication effectiveness objectively. Evaluations typically rely heavily on single or limited qualitative methods like case studies or interviews, with less emphasis on triangulation or reflexivity in the humanities visualization community. VIS*H evaluation practices often do not explicitly address the complex epistemological and interpretive challenges posed by humanities scholarship, instead adopting validation frameworks suited for positivist, task-oriented inquiry.

Departure from common sense

The paper argues against the easy assumption that evaluation quality improves simply by doing familiar qualitative validation such as case studies or interviews. Instead, it shows that common VIS*H practice sits in a methodological comfort zone: monomethod evaluations are widespread, but stronger rigor comes from strategically combining richer forms of evidence and not from relying on popularity of methods alone.

Actual novelty

The contribution is a field-level synthesis of VIS*H evaluation practice based on a systematic survey of 171 design studies, paired with deeper analysis of how specific methods and recurring workflows relate to rigor. Rather than only counting methods, the paper uses regression and cluster analysis to characterize archetypical evaluation workflows and to distinguish frequent but weak practices from less common, more evidence-rich combinations.

Evidence

Evidence comes from a systematic survey of 171 VIS*H design studies, descriptive coding of evaluation methods and rigor, analysis of method co-occurrence and diversity, a multivariate regression framing the relation between methods and quality, and hierarchical clustering of recurrent workflows. The paper also grounds its interpretation in discussion of humanities-specific epistemic tensions and explicit author positionality.

“ Following this, we apply hierarchical cluster analysis to identify recurrent methodological constellations, revealing the archetypical workflows that define the field’s current evaluative paradigms”

actual novelty · 6 Analysis of Evaluation Methods and Workflows · confidence 0.96

“they are characterized by a reliance on a single, primarily qualitative method. This finding suggests that, while case studies and interviews are essential for tool validation and gathering contextual insights, they are insufficient on their own. A simple walkthrough or an unstructured conversation often serves more to demonstrate a tool’s features than ”

departure from common sense · 7.1 The Methodological Comfort Zone · confidence 0.97

“ We note that evaluation practices differ across global contexts and institutional traditions, and this situatedness may have influenced how we interpreted concepts such as evaluation quality and rigor”

limitation · 3.1 Team Positionality and Evaluation Lens · confidence 0.95

“ To examine how this tension manifests in practice, we systematically surveyed 171 VIS*H design studies to analyze their evaluation workflows and rigor according to standard practice. Our findings reveal recurring flaws, such as an over-reliance on monomethod approaches, and show that higher-quality evaluations emerge from workflows that effectively triangulate diverse evidence”

validation scope · Abstract · confidence 0.99

Limits

Method limits

The study is bounded by its coded survey corpus and by an explicitly situated evaluation lens. The authors note that their emphasis on quantifiable features risks foregrounding forms of rigor familiar from empirical visualization research, and they acknowledge that their European institutional context may have influenced how concepts such as evaluation quality and rigor were interpreted.

Deployment limits

The paper studies published evaluation practices rather than deploying a new system in real-world settings. Its recommendations therefore primarily inform researchers, reviewers, and field-level methodological debates in VIS*H, not direct operational deployment outcomes or product adoption contexts.

Boundary conditions

Claims are scoped to visualization design studies in the humanities and to how those papers report evaluation workflows and rigor. The argument is especially relevant where interpretive, discursive, and humanities-specific criteria intersect with visualization evaluation, and it should not be read as a universal indictment of qualitative methods outside this VIS*H context.

Position in field

This paper positions itself as a bridge between mainstream visualization evaluation and humanities-oriented interpretivist critique. It contributes a rare field-wide empirical baseline for VIS*H evaluation practice while also arguing that future validation must better align with humanities theories, interpretive processes, and discursive forms of rigor.

Abstract

The intersection of visualization and the humanities (VIS*H) is marked by a tension between chasing analytical "insight'' and interpretive "meaning.'' The effectiveness of visualization techniques hinges on established evaluation frameworks that assess both analytical utility and communicative efficacy, creating a potential mismatch with the non-positivist, interpretive aims of humanities scholarship. To examine how this tension manifests in practice, we systematically surveyed 171 VIS*H design studies to analyze their evaluation workflows and rigor according to standard practice. Our findings reveal recurring flaws, such as an over-reliance on monomethod approaches, and show that higher-quality evaluations emerge from workflows that effectively triangulate diverse evidence. From these findings, we derive recommendations to refine quality and validation criteria for humanities visualizations, and juxtapose them to ongoing critical debates in the field, ultimately arguing for a paradigm shift that can reconcile the advantages of established validation techniques with the interpretive depth required for humanistic inquiry.