CHI '26 · Honorable mention · full-paper review · confidence medium-high

Quantifying the Novelty Bias when Evaluating Interactive Prototypes

Yumeng Ma , Alexis Hiniker , Jacob O. Wobbrock

This is a solid CHI paper because it converts a familiar but often hand-waved concern—novelty bias—into a controlled causal result across multiple prototype categories. The main value is not a new interface, but a sharper empirical warning that participant preference and ratings can be distorted by framing even when functionality is held constant.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: causal knowledge typical · 31/268
Novelty type: empirical finding typical · 68/268
Abstraction level: field typical · 41/268
Generalization target: field argument typical · 55/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s contribution is best understood as a field-level empirical finding with clear methodological implications for HCI evaluation. The authors do not merely speculate that novelty can bias judgments; they run a within-subjects study with 48 participants and four pairs of functionally identical prototypes, spanning mice, keyboards, search engines, and AI chatbots, where the only salient difference is cosmetic presentation plus an explicit “old/new” label. That design supports a causal reading of the effect. The reported outcomes are also nicely differentiated: preference shifts are large, subjective ratings move upward for the “new” version, and performance effects are present but comparatively modest and mostly visible in error measures. That pattern matters because it suggests novelty is not just a generic positivity halo; it changes interpretation and preference more than it changes task execution. The paper’s novelty is therefore not a new artifact or interaction technique, but a quantified causal demonstration that should make CHI reviewers and authors more cautious about treating participant judgments as neutral evidence of prototype quality. The limitations are also appropriately bounded. The manipulation is explicit and somewhat artificial, the setting is controlled, and the exposure is short and first-use oriented. The authors acknowledge that these conditions may not capture longitudinal use or subtler novelty cues in the wild. That restraint improves credibility rather than weakening the paper. Overall, this is a strong honorable-mention-level contribution because it is methodologically clean, broadly relevant to HCI practice, and likely to be cited whenever prototype evaluation, framing effects, or novelty confounds are discussed.

What Changed

Canon before

HCI evaluation commonly treats participant judgments as evidence of prototype quality; this paper targets the assumption that novelty labels are neutral rather than influential.

Departure from common sense

A simple “new” label can materially change how people judge otherwise identical prototypes, and that influence appears stronger for preference and subjective ratings than for objective performance. That is counter to the common expectation that cosmetic framing should be secondary to actual functionality.

Actual novelty

The paper’s novelty is a causal experimental demonstration that labeling functionally identical prototypes as “old” versus “new” shifts preference, subjective ratings, and some error measures across several technology categories. It turns a familiar concern into quantified evidence.

Evidence

The study reports a within-subjects experiment with 48 participants and four pairs of functionally identical prototypes spanning mice, keyboards, search engines, and AI chatbots. The manipulation was cosmetic plus explicit old/new labeling. Results show strong preference shifts toward “new,” some subjective rating inflation, and modest performance effects, especially on errors. The paper also states limitations about the explicit manipulation and controlled setting.

“ Our study makes three contributions to HCI research by providing: (1) a causal (experimental) demonstration that labeling a functionally identical system as “new” shifts preferences, inflates subjective ratings, and reduces input errors, (2) evidence that these perceptual shifts can outweigh or misalign with objective performance, and (3) clarification that TR influences baseline performance and can interact with novelty, but does not shield judgments from novelty-driven bia”

actual novelty · Abstract; Discussion (5.4) implications · confidence 0.72

“ Our study makes three contributions to HCI research by providing: (1) a causal (experimental) demonstration that labeling a functionally identical system as “new” shifts preferences, inflates subjective ratings, and reduces input errors, (2) evidence that these perceptual shifts can outweigh or misalign with objective performance, and (3) clarification that TR influences baseline performance and can interact with novelty, but does not shield judgments from novelty-driven bia”

departure from common sense · Abstract; Discussion (5.4) framing of novelty reweighting interpretations · confidence 0.76

“er, Theo Raimbault, Lisa Geierhaas, and Matthew Smith. 2025. Small, Medium, Large? A Meta-Study of Effect Sizes at CHI to Aid Interpretation of Effect Sizes and Power Calculation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems . 1–28. Digital Library Google Scholar [64] Ra”

limitation · 5.5 Limitations and Future Work · confidence 0.91

“ To quantify this effect, we conducted a within-subjects study of 48 participants comparing four pairs of functionally identical prototypes (mice, keyboards, search engines, and AI chatbots)”

validation scope · Abstract; Limitations/Future Work (5.5) describing controlled conditions and task/session format · confidence 0.84

Limits

Method limits

The evidence comes from a controlled within-subjects lab study with explicit old/new labels, cosmetic differences, and short first-use tasks. The design supports causal inference about the labeling manipulation, but it does not isolate all real-world novelty cues or longer-term adaptation.

Deployment limits

The findings are most directly relevant to early-stage prototype evaluation and lab studies that rely on participant judgments. They may be less directly transferable to longitudinal use, field settings, or evaluations where novelty is conveyed implicitly rather than by explicit labels.

Boundary conditions

The effect is demonstrated under explicit “old/new” labeling, single-session exposure, and stripped-down prototype comparisons. The paper itself notes uncertainty about whether novelty is fleeting or persistent and whether other signals produce similar effects.

Position in field

This paper strengthens a longstanding HCI concern by quantifying novelty bias as an experimental effect rather than a vague confound. Its contribution is less a new interaction technique than a field-level argument about how prototype evaluations should be interpreted.

Abstract

Experiments in human-computer interaction (HCI) often evaluate whether a prototype is “better,” but novelty alone can affect users’ judgments and possibly performance. To quantify this effect, we conducted a within-subjects study of 48 participants comparing four pairs of functionally identical prototypes (mice, keyboards, search engines, and AI chatbots). Each pair differed only in cosmetic features and a label marking one as “old” and the other as “new.” Novelty labeling shifted preference: up to 77% favored the version labeled “new.” Subjective ratings for the search engine increased under the “new” label by up to 7.1%. For the AI chatbot, ratings were driven by preference, with the preferred version rated up to 11.6% higher than the unpreferred one. Performance differences were modest and emerged for errors (e.g., 9.7% fewer misses with the “new” mouse, up to 7.2% lower error rates with the “new” keyboard). Technology readiness predicted baseline skill and occasionally moderated performance but did not protect judgments from novelty bias. These results show that novelty labeling reframes interpretation and preference more than performance, raising concerns for HCI evaluations relying on participant judgments.