Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents
This is a strong CHI paper because it reframes LLM bias evaluation from isolated prompt responses to conversational behavioral emulation, backed by a large human study and GPT-4/GPT-5 comparisons. The main caveat is scope: the evidence is compelling for the tested abstract decision tasks, but not yet for broader real-world decision support.
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- descriptive knowledge typical · 92/268
- Novelty type
- empirical finding typical · 68/268
- Abstraction level
- task typical · 36/268
- Generalization target
- task class typical · 63/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
This paper’s main value is conceptual as much as empirical: it pushes LLM bias evaluation away from the familiar question of whether a model reproduces a known bias in a single prompt and toward whether it can emulate human decision dynamics in an interactive setting. That is a meaningful departure from common-sense expectations about what these models can do, because it treats bias as something that emerges under dialogue structure and contextual load rather than as a static output property. The validation is also unusually substantial for this kind of claim: the authors report a human experiment with N=1100, use three adapted decision scenarios, and then simulate comparable conditions with GPT-4 and GPT-5 using participant demographics and dialogue transcripts. That gives the paper credible evidence for a bounded empirical finding about bias reproduction in conversational decision tasks. At the same time, the paper is careful enough to acknowledge that the scenarios are classic, abstract, and text-based, which limits ecological validity. So the right reading is not that GPT agents generally model human bias in the wild, but that they can reproduce specific bias patterns under controlled conversational conditions. In CHI terms, that makes the contribution strong as an empirical demonstration and as a methodological reframing, but not as a broad deployment warrant for high-stakes settings. The paper is best understood as establishing a promising task family and a sharper evaluation lens for future work on adaptive, bias-aware LLM systems.
What Changed
Canon before
Prior CHI and HCI work on LLM bias largely emphasized whether models reproduce known biases in isolated prompts or benchmark-style settings; this paper shifts the question to conversational, interactive decision support and asks whether models can emulate human bias dynamics under contextual manipulation.
Departure from common sense
The non-obvious move is to treat bias emulation as an interactive behavioral modeling problem, not just a prompt-level artifact. The paper asks whether GPT-based agents can track individual-level bias dynamics when dialogue complexity and cognitive load vary, which goes beyond the common expectation that LLMs merely mirror surface stereotypes or static response tendencies.
Actual novelty
The paper’s novelty is the claim that no prior work had examined LLM-based human behavioral modeling in this conversational, context-sensitive form. It combines three adapted decision scenarios, a large human study, and GPT-4/GPT-5 simulations using participant demographics and dialogue transcripts to test whether models can reproduce bias patterns under comparable interactive conditions.
Evidence
The paper validates its claims with a human experiment of N=1100 across three adapted decision scenarios, then replays comparable conditions with GPT-4 and GPT-5 using participant demographics and dialogue transcripts. The evidence supports a bounded claim about Status Quo bias emulation in conversational decision tasks, with explicit discussion of generalization limits to abstract text-based scenarios.
“ We address this gap through a step-by-step empirical investigation, focusing on the question of How well do LLMs represent human biased decision-making behavior within a simulated conversational context, particularly when contextual factors are at play”
actual novelty · Introduction gap statement · confidence 0.72
“Predicting Biased Human Decision-Making with Large Language Models in Conversational Settings IUI '26: Proceedings of the 31st International Conference on Intelligent User Interfaces We examine whether large language models (LLMs) can predict biased decision-making in conversational settings, and whether their predictions capture not only human cognitive biases but also how those effects change under cognitive load.”
departure from common sense · Abstract/Introduction framing of the research question · confidence 0.60
“ Limitations The study in this paper uses classic decision scenarios that are abstract and text-ba”
limitation · Limitations section · confidence 0.78
“ To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5”
validation scope · Abstract + Results/Agent Experiments overview · confidence 0.66
Limits
Method limits
The study uses classic decision scenarios that are abstract and text-based, which supports experimental control but narrows ecological validity. The paper also appears centered on Status Quo bias and does not establish broad coverage across bias types or richer conversational settings.
Deployment limits
Any deployment claim should be limited to controlled, text-based decision-support interactions. The results do not by themselves justify use in high-stakes domains or in settings where dialogue structure, multimodal cues, or real-world stakes differ substantially from the study tasks.
Boundary conditions
Findings are bounded by abstract, text-based decision scenarios, conversational framing, and the specific bias/task combinations tested. The paper itself notes that real decision-support dialogues in finance, recruitment, or healthcare may be more complex than the experimental setting.
Position in field
This sits at the intersection of LLM behavioral modeling, bias research, and interactive decision support. Its contribution is to move from static bias reproduction toward conversational emulation of aggregate human choice behavior, while remaining an empirical demonstration rather than a general theory of human-AI decision dynamics.