On the Computational Reproducibility of Human-Computer Interaction
This is a strong CHI meta-science paper with clear field relevance. Its main contribution is not a new interaction technique but a mixed-methods reproducibility audit that turns open-science rhetoric into measurable evidence. The scope is appropriately narrow, and the paper is careful to separate reproducibility from broader validity claims.
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- descriptive knowledge typical · 92/268
- Novelty type
- empirical finding typical · 68/268
- Abstraction level
- field typical · 41/268
- Generalization target
- field argument typical · 55/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- low typical · 53/268
Review Summary
This paper is best read as a field-level methodological contribution rather than a conventional HCI systems or interaction paper. The authors do something concrete and valuable: they identify CHI papers that shared study data and analysis code, attempt to reproduce the reported results, and then use surveys and interviews to understand why reproducibility succeeds or fails. That combination gives the paper both a descriptive benchmark and explanatory context. The abstract’s 49% full reproduction rate is a memorable result, but the more important contribution is the framing: shared artifacts are only meaningful if they actually enable third parties to rerun the analysis and recover the reported findings. The paper also makes a normative move that is somewhat stronger than common practice in HCI, arguing that computationally derived evidence should be reproducible by default unless there is a valid reason it cannot be recomputed. At the same time, the authors are careful about scope. They explicitly say they did not evaluate the correctness of the statistical analysis or the validity of the study, and that their results reflect reporting quality rather than overall paper quality. That restraint matters because it keeps the evidence aligned with the claim. The limitations are also credible: they relied on openly provided repositories, did not contact authors when materials were missing or troubleshooting failed, and therefore may undercount reproducibility in some cases. Overall, the paper’s value is in making reproducibility in HCI measurable, discussable, and actionable, while avoiding the common mistake of overgeneralizing from a computational rerun audit to broader scientific validity.
What Changed
Canon before
Prior CHI open-science discussions often treated sharing data and code as a positive practice, but did not establish a field-wide empirical baseline for whether shared computational artifacts actually let others rerun and recover reported results.
Departure from common sense
The paper makes a stronger-than-default methodological claim: shared data and code should not be treated as sufficient in themselves, because the default expectation is that computationally derived evidence ought to be reproducible by third parties unless there is a valid reason it cannot be recomputed.
Actual novelty
Its novelty is empirical and field-level: the authors systematically attempted to reproduce CHI papers that shared study data and analysis code, then complemented that with author surveys and interviews to explain why reproducibility succeeds or fails.
Evidence
The paper combines a reproducibility audit of CHI papers with author survey/interview evidence. The abstract states they identified all CHI papers sharing study data and analysis code and attempted reproduction, reporting a 49% full reproduction rate. The paper also explicitly limits the evaluation to whether reported results can be achieved from provided data and code, not statistical correctness or study validity. Limitations note reliance on openly provided repositories and no author contact during troubleshooting or missing-material cases.
“ To better understand the practices leading to such reproducibility rates, we conducted a survey and interviews with the authors of the sampled papers, asking them about their perceived reproducibility, motivation for engaging in open science, and the obstacles authors face in making data and analysis code reproducible”
actual novelty · Abstract/Introduction and Section 3 (Reproducing CHI) · confidence 0.80
“ But the default assumption should still be that computationally derived evidence should be reproducible unless there is a valid reason it can not be recomputed by third party”
departure from common sense · Section 5.1 (Why reproducibility is essential) · confidence 0.72
“Zhaokun Ma. 2018. An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences 115, 11 (2018), 2584–2589. arXiv: https://www.pnas.org/doi/pdf/10.1073/pnas.1708290115 doi:10.1073/pnas.1708290115 Google Scholar [51] Poorna Talkad Sukumar, Ignacio Avellino, Christian Remy, Michael Ann DeVito, Tawanna R. Dillahunt, Joanna McGrenere, and Max L. Wilson. 2020. Transparency in Qualitative Research: Increasing Fairness in the CHI Review Process. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) ( CHI EA ’20 ). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3334480.3381066 Digital Library Google Scholar [52] Patrick Vandewalle, Jelena Kovacevic, and Martin Vetterli. 2009. Reproducible research in signal processing. IEEE Signal Processing Magazine 26, 3 (2009), 37–47. Google Scholar [53] Steeven Villa, Thomas Kosch, Felix Grelka, Albrecht Schmidt, and Robin Welsch. 2023. The placebo effect of human augmentation: Anticipating”
limitation · Section 5.5 (Limitations) · confidence 0.82
“Behavior Literature. PS: Political Science & Politics 51, 4 (2018), 799–803. doi:10.1017/S1049096518000926 Google Scholar [48] Victoria Stodden. 2015. Reproducing statistical results. Annual Review of Statistics and Its Application 2, 1 (2015), 1–19. Google Scholar [49] Victoria Stodden, Marcia McNutt, David H Bailey, Ewa Deelman, Yolanda Gil, Brooks Hanson, Michael A Heroux, John PA Ioannidis, and Michela Taufer. 2016. Enhancing reproducibility for computational methods. Science 354, 6317 (2016), 1240–1241. Google Scholar [50] Victoria Stodden, Jennifer Seiler, and Zhaokun Ma. 2018. An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences 115, 11 (2018), 2584–2589. arXiv: https://www.pnas.org/doi/pdf/10.1073/pnas.1708290115 doi:10.1073/pnas.1708290115 Google Scholar [51] Poorna Talkad Sukumar, Ignacio Avellino, Christian Remy, Michael Ann DeVito, Tawanna R. Dillahunt,”
validation scope · Section 5.5 (Limitations) · confidence 0.88
Limits
Method limits
The study measures computational reproducibility only: it checks whether provided data and code can recreate reported results. It does not assess whether the statistical analysis is correct or whether the underlying study is valid, so the findings should not be read as a general quality audit of the papers.
Deployment limits
The conclusions are most applicable to CHI papers that share usable data and analysis code in public repositories. They do not directly transfer to papers without shared artifacts, to studies whose outputs depend on unavailable environment details, or to contexts where authors can be contacted to resolve missing pieces.
Boundary conditions
The paper’s claims are bounded by the availability and completeness of shared repositories, the authors’ troubleshooting process, and the specific CHI corpus they sampled. Reproducibility here is operationalized as rerunning provided computational artifacts, not as replication of the study in a broader sense.
Position in field
This work sits at the intersection of open science, reproducibility auditing, and HCI methods. It provides a field-level benchmark and a normative argument that reproducibility should be treated as a desirable minimum for computational evidence in HCI.