CHI '26 · Honorable mention · full-paper review · confidence medium-high

Oops, I Did It Again (But I Know It): Robot Failure Consistency and Awareness in Human-Robot Collaboration

Ramtin Tabatabaei , Vassilis Kostakos , Wafa Johal

This is a solid CHI honorable-mention contribution because it moves the conversation from “how bad was the failure?” to “what pattern of failures did users just experience?” The main value is empirical: a controlled study shows sequence and awareness matter, but the claims stay appropriately bounded to scripted lab interactions.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: causal knowledge typical · 31/268
Novelty type: empirical finding typical · 68/268
Abstraction level: task typical · 36/268
Generalization target: task class typical · 63/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s strongest contribution is that it makes robot failure perception history-sensitive. In much of the prior CHI/HRI discussion, the central variables are failure frequency, severity, and repair quality; here, the authors show that the order and diversity of failures also matter, and that this can change perceived intelligence and, to a lesser extent, trust. That is a meaningful departure from a simple common-sense model in which users merely count mistakes. The study design is also well matched to the claim: a controlled collaborative physical task with 54 participants, manipulating homogeneous versus heterogeneous failure sequences and none/partial/full awareness, gives the authors a credible basis for the causal story they tell. The paper’s own discussion is careful about scope, explicitly noting that the setup may not generalize to more complex, unpredictable, or safety-critical real-world settings, and that the awareness behaviors were pre-scripted. That restraint improves credibility. I would characterize the contribution as an empirical finding with moderate-to-strong novelty rather than a new framework or system. The main limitation is external validity: the task is narrow, the failures are safe and scripted, and there are only three failure instances per session. Still, within that boundary, the paper offers a useful refinement for designers of collaborative robots: if a robot must fail, the pattern of failure and whether the robot acknowledges it can shape how competent and trustworthy it appears.

What Changed

Canon before

Prior CHI work on robot failure and trust emphasized failure frequency, severity, and repair, but not the temporal structure of repeated failures or how a robot’s explicit awareness of its own failures changes user judgments.

Departure from common sense

The paper argues that people do not simply react to how often a robot fails; they also respond to whether failures repeat in a homogeneous pattern or vary across types. That is a non-obvious departure from a frequency-only intuition, because the sequence structure itself changes perceived intelligence and, in some cases, trust.

Actual novelty

The paper’s novelty is a sequence-aware account of robot failure perception: it extends trust calibration thinking by showing that trust and perceived intelligence depend on failure type, order, and diversity, and that awareness cues can soften negative judgments. The contribution is not a new robot platform, but a new empirical relationship between failure history and user evaluation.

Evidence

The paper reports a controlled lab study with 54 participants in a collaborative physical task, manipulating failure sequence (homogeneous vs. heterogeneous) and awareness (none, partial, full). The discussion states that evaluations are shaped by failure type and sequence, and that heterogeneous sequences tended to produce larger declines, especially for perceived intelligence. The paper also explicitly notes limits to generalization because the setup used safe, scripted failures and only three failure instances per session.

“ [ 21 ] by demonstrating that evaluations of trust and perceived intelligence depend not just on failure frequency but on the type and temporal structure of failures”

actual novelty · Abstract + Discussion 5.1-5.3 · confidence 0.66

“1 Beyond Frequency: Mixed Failure Sequences Erode Trust and Perceived Intelligence Our results indicate that user evaluations are shaped by the type of failures and their sequence.”

departure from common sense · Discussion 5.1 (Beyond Frequency...) · confidence 0.70

“Measures Experiment. In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN) . 936–943. ISSN: 1944-9437. Crossref Google Scholar [45] Christine P Lee, Pragathi Praveena, and Bilge Mutlu. 2024. REX: Designing User-centered Repair and Explanations to Address Robot Failures. In Proceedings of the 2024 ACM Designing Interactive Systems Conference ( DIS ’24 ). Association for Computing Machinery, New York, NY, USA, 2911–2925. Digital Library Google Scholar [46] John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance. Human Factors 46, 1 (March 2004), 50–80. Publisher: SAGE Publications Inc. Crossref Google Scholar [47] Gregory LeMasurier, Alvika Gautam, Zhao Han, Jacob W. Crandall, and Holly A. Yanco. 2024. Reactive or Proactive? How Robots Should Explain Failures. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction ( HRI ’24 ). Association for Computing Machinery, New York, NY, USA, 413–422. Digital Library Google Scholar [48] Roy J. Lewicki and Chad Brinsfield. 2017. Trust Repair. Annual Review of Organizational Psychology and Organizational Behavior 4, Volume 4, 2017 (March 2017), 287–313. Publisher: Annual Reviews. Crossref Google Scholar [49] Liangkai Liu, Zheng Dong, Yanzhi Wang, and Weisong Shi. 2022. Prophet: Realizing a Predictable Real-time Perception Pipeline for Autonomous Vehicles. In 2022 IEEE Real-Time Systems Symposium (RTSS) . 305–317. ISSN: 2576-3172. Crossref”

limitation · Discussion 5.4 Limitations and Paths Forward · confidence 0.88

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract In human–robot collaboration, repeated failures are in”

validation scope · Abstract + Methodology 3.2-3.3 · confidence 0.82

Limits

Method limits

The evidence comes from a controlled experiment with scripted failure sequences and awareness behaviors, so causal interpretation is strongest within that design. However, the study does not test open-ended interaction dynamics, long-horizon adaptation, or alternative robot embodiments, and the paper itself notes that the awareness behaviors were pre-scripted.

Deployment limits

The findings are most directly applicable to short, structured collaborative tasks where failures can be anticipated and narrated. They are less directly transferable to high-stakes, unpredictable, or safety-critical deployments, where failure consequences, repair costs, and user expectations differ substantially.

Boundary conditions

The paper itself identifies boundary conditions: only three failure instances per session, safe and non-damaging failures, and a Tangram-style collaborative task. It also notes that the freezing failure always ended in a successful task outcome, which may constrain how broadly the sequence effects generalize across failure types.

Position in field

This sits in the CHI/HRI literature on trust calibration and robot transparency, but shifts attention from isolated failure events to failure histories and awareness cues. It is best read as an empirical refinement of how users interpret repeated robot mistakes rather than as a general theory of robot trust.

Abstract

In human–robot collaboration, repeated failures are inevitable and can undermine trust and perceptions of robot intelligence. While some failures severely disrupt tasks and others are relatively benign, their cumulative impact on trust is not clearly understood. We investigated whether users perceive repeated failures of the same type differently from varied failures, and how robot awareness of its own failures affects these perceptions. In a collaborative physical task with 54 participants, we manipulated failure sequence (homogeneous vs. heterogeneous) and awareness (none, partial, full). Results show that trust and perceived intelligence were influenced by both current and prior failures, with homogeneous sequences leading to smaller reductions in these evaluations compared to heterogeneous ones. Robots displaying awareness, whether partial or full, were consistently rated higher than unaware robots, particularly for grasping and planning failures. Our findings provide a deeper understanding of how failure type, sequence, and robot awareness shape users' perceptions of collaborative robots.