CHI '26 · Honorable mention · full-paper review · confidence medium-high

When Help Hurts: Verification Load and Fatigue with AI Coding Assistants

Guangrui Fan , Dandan Liu , Lihu Pan , Rui Zhang

This is a solid CHI paper because it does more than report that AI coding assistants can help or hurt: it isolates interface effects, quantifies a hidden verification burden, and connects that burden to stress and fatigue. The contribution is strongest as a measurement-plus-experiment package rather than as a broad systems claim.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: causal knowledge typical · 31/268
Novelty type: measurement less common · 3/268
Abstraction level: interaction typical · 22/268
Generalization target: design family typical · 38/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s value is that it turns an intuitive but often under-measured concern into a concrete HCI result: AI coding assistants can reduce some immediate workload while simultaneously creating verification work that developers must absorb. The authors do not just assert this; they hold the LLM fixed, compare Inline, Chat, Structured, and no-AI conditions, and then introduce a verification-load index built from observable interaction traces such as failures, time-to-first-compile, churn, pauses, and switches. That makes the paper more than a standard productivity comparison. It is a measurement contribution that helps explain why “help” can still feel costly in practice. The evidence base is reasonably strong for the claims made. The study is controlled, within-subjects, and includes N=60 participants across three Python tasks, which is enough to support the reported workload, time, correctness, and trajectory differences at the interface level. The paper also appears careful about scope: it frames the results around task-bounded programming work and explicitly notes that broader settings such as larger collaborative codebases, other languages, and longer-term adoption remain open. That restraint matters because the verification-load construct is promising, but its exact form and thresholds will likely vary with domain, repository context, and tool ecosystem. As a CHI contribution, the paper is strongest in its combination of empirical finding and measurement framing. The novelty is not a new assistant model, but a way to see and compare the hidden costs of using assistants. The main limitation is that the study design cannot fully capture longitudinal adaptation, organizational workflows, or backend/tool integration effects. Still, within its scope, the paper makes a persuasive case that evaluation of AI coding assistants should include verification burden, not just speed or correctness.

What Changed

Canon before

Prior CHI work on AI coding assistants largely emphasized productivity gains, correctness, or user satisfaction, but less often isolated interface-level effects from backend model effects or quantified verification burden as a distinct construct.

Departure from common sense

The paper’s core counterintuitive point is that AI assistance does not simply reduce effort; it can shift effort into verification, so “help” may increase burden through checking and repair. That reframes assistant value away from raw output generation and toward the hidden cost of validating it.

Actual novelty

The paper’s main novelty is a mode-agnostic verification-load index built from failures, time-to-first-compile, churn, pauses, and switches, paired with a controlled comparison of Inline, Chat, and Structured prompting while holding the LLM fixed. It uses that construct to explain how stress and fatigue rise across tasks and to separate interface effects from backend effects.

Evidence

The evidence supports a controlled within-subjects experiment with N=60 participants on three Python tasks, comparing Inline, Chat, Structured, and no-AI control while holding the model fixed. Reported outcomes include lower workload, faster completion, improved correctness, mode-specific tradeoffs, and a verification-load composite that partially mediates stress/fatigue trajectories.

“ Our contribution is to instantiate such a composite (verification‑load), show that it varies systematically by interaction mode and complexity (with expertise moderation), and demonstrate that it partially mediates increases in stress and fatigue across tasks—thereby linking theory to actionable interface design (”

actual novelty · Abstract + Results mechanism framing · confidence 0.82

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract AI coding ass”

departure from common sense · Abstract/Introduction framing · confidence 0.80

“6 Limitations and scope of claims Our study focuses on three Python tasks with controlled oracles, a single structured-elicitation design, and a single-session exposure (plus a small next-day check).”

limitation · Limitations and scope of claims · confidence 0.90

“ 3 Method Our methods isolate interaction design from model capability to answer three questions: (RQ1) the immediate effects of mode on workload, time, and correctness under backend parity; (RQ2) whether repeated-use trajectories in stress/fatigue differ with and without AI and whether a mode-agnostic verification-load mediates these trajectories; and (RQ3) how expertise and continuous task complexity moderate mode eff”

validation scope · Method/Results overview · confidence 0.55

Limits

Method limits

The study is constrained to three Python tasks, a controlled-oracle setup, a single structured-elicitation design, and mostly single-session exposure with only a small next-day check. These constraints limit inference about long-term learning, broader programming domains, and more complex collaborative settings.

Deployment limits

The findings are most directly applicable to interface design for AI coding assistants in task-bounded programming workflows. They may not transfer unchanged to larger codebases, other languages, repository-scale work, or systems that combine retrieval, tool use, or personalized context.

Boundary conditions

The reported effects are bounded by task complexity, prompting mode, and the fixed backend model. The paper itself notes that repository context, tool-augmented backends, and personalized retrieval were prohibited, so crossover points and effect magnitudes may shift in real-world deployments.

Position in field

This paper sits at the intersection of AI-assisted programming, HCI measurement, and workload/fatigue research. Its contribution is to make verification burden visible as an interface-level phenomenon and to argue that evaluation should track that burden alongside conventional productivity and correctness outcomes.

Abstract

AI coding assistants help, but developers still spend effort verifying model output. We isolate interface effects by holding a single LLM fixed while N=60 participants solve three Python tasks with Inline, Chat, or Structured prompting, plus a no-AI control. AI reduced workload by -18.2 TLX points and time by 22% (25.0 vs. 32.1 min) and improved correctness (OR=1.71). Within AI, Inline is fastest and lowest-load on simple work; Chat yields higher correctness beyond a per-observation complexity threshold (z≈+0.41) without a time cost; Structured benefits novices at mid complexity. We introduce a mode-agnostic verification-load index (failures, time-to-first-compile, churn, pauses, switches) that partially mediates rising stress/fatigue across tasks. We translate these findings into design guidance: adaptive mode orchestration, transparency on demand, and verification-aware packaging, and propose reporting verification load alongside outcomes to evaluate interfaces as models evolve.