CHI '26 · Honorable mention · full-paper review · confidence medium-high

Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents

Stephen Pilli , Vivek Nallur

This is a strong CHI paper because it reframes LLM bias evaluation from isolated prompt responses to conversational behavioral emulation, backed by a large human study and GPT-4/GPT-5 comparisons. The main caveat is scope: the evidence is compelling for the tested abstract decision tasks, but not yet for broader real-world decision support.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: empirical finding typical · 68/268
Abstraction level: task typical · 36/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s main value is conceptual as much as empirical: it pushes LLM bias evaluation away from the familiar question of whether a model reproduces a known bias in a single prompt and toward whether it can emulate human decision dynamics in an interactive setting. That is a meaningful departure from common-sense expectations about what these models can do, because it treats bias as something that emerges under dialogue structure and contextual load rather than as a static output property. The validation is also unusually substantial for this kind of claim: the authors report a human experiment with N=1100, use three adapted decision scenarios, and then simulate comparable conditions with GPT-4 and GPT-5 using participant demographics and dialogue transcripts. That gives the paper credible evidence for a bounded empirical finding about bias reproduction in conversational decision tasks. At the same time, the paper is careful enough to acknowledge that the scenarios are classic, abstract, and text-based, which limits ecological validity. So the right reading is not that GPT agents generally model human bias in the wild, but that they can reproduce specific bias patterns under controlled conversational conditions. In CHI terms, that makes the contribution strong as an empirical demonstration and as a methodological reframing, but not as a broad deployment warrant for high-stakes settings. The paper is best understood as establishing a promising task family and a sharper evaluation lens for future work on adaptive, bias-aware LLM systems.

What Changed

Canon before

Prior CHI and HCI work on LLM bias largely emphasized whether models reproduce known biases in isolated prompts or benchmark-style settings; this paper shifts the question to conversational, interactive decision support and asks whether models can emulate human bias dynamics under contextual manipulation.

Departure from common sense

The non-obvious move is to treat bias emulation as an interactive behavioral modeling problem, not just a prompt-level artifact. The paper asks whether GPT-based agents can track individual-level bias dynamics when dialogue complexity and cognitive load vary, which goes beyond the common expectation that LLMs merely mirror surface stereotypes or static response tendencies.

Actual novelty

The paper’s novelty is the claim that no prior work had examined LLM-based human behavioral modeling in this conversational, context-sensitive form. It combines three adapted decision scenarios, a large human study, and GPT-4/GPT-5 simulations using participant demographics and dialogue transcripts to test whether models can reproduce bias patterns under comparable interactive conditions.

Evidence

The paper validates its claims with a human experiment of N=1100 across three adapted decision scenarios, then replays comparable conditions with GPT-4 and GPT-5 using participant demographics and dialogue transcripts. The evidence supports a bounded claim about Status Quo bias emulation in conversational decision tasks, with explicit discussion of generalization limits to abstract text-based scenarios.

“ We address this gap through a step-by-step empirical investigation, focusing on the question of How well do LLMs represent human biased decision-making behavior within a simulated conversational context, particularly when contextual factors are at play”

actual novelty · Introduction gap statement · confidence 0.72

“Predicting Biased Human Decision-Making with Large Language Models in Conversational Settings IUI '26: Proceedings of the 31st International Conference on Intelligent User Interfaces We examine whether large language models (LLMs) can predict biased decision-making in conversational settings, and whether their predictions capture not only human cognitive biases but also how those effects change under cognitive load.”

departure from common sense · Abstract/Introduction framing of the research question · confidence 0.60

“ Limitations The study in this paper uses classic decision scenarios that are abstract and text-ba”

limitation · Limitations section · confidence 0.78

“ To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5”

validation scope · Abstract + Results/Agent Experiments overview · confidence 0.66

Limits

Method limits

The study uses classic decision scenarios that are abstract and text-based, which supports experimental control but narrows ecological validity. The paper also appears centered on Status Quo bias and does not establish broad coverage across bias types or richer conversational settings.

Deployment limits

Any deployment claim should be limited to controlled, text-based decision-support interactions. The results do not by themselves justify use in high-stakes domains or in settings where dialogue structure, multimodal cues, or real-world stakes differ substantially from the study tasks.

Boundary conditions

Findings are bounded by abstract, text-based decision scenarios, conversational framing, and the specific bias/task combinations tested. The paper itself notes that real decision-support dialogues in finance, recruitment, or healthcare may be more complex than the experimental setting.

Position in field

This sits at the intersection of LLM behavioral modeling, bias research, and interactive decision support. Its contribution is to move from static bias reproduction toward conversational emulation of aggregate human choice behavior, while remaining an empirical demonstration rather than a general theory of human-AI decision dynamics.

Abstract

Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors, such as cognitive load, interact with these biases. We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment (N=1100). Participants engaged with a chatbot that facilitates decision-making through simple or complex dialogues. Results revealed robust biases. To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5. The LLMs reproduced human biases with precision. We found notable differences between models in how they aligned human behavior. This has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems in interactive contexts.