CHI '26 · Honorable mention · full-paper review · confidence medium-high

Interview-Informed Generative Agents for Product Discovery: A Validation Study

Zichao Wang , Alexa Siu

This is a strong validation paper because it resists the easy story that synthetic users either work or fail outright. Instead, it shows a more useful and more cautious result: interview-informed agents are poor stand-ins for specific people, yet they may still help with early concept screening when teams only need approximate population-level signals.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: empirical finding typical · 68/268
Abstraction level: task typical · 36/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

What makes this paper valuable is not a flashy new agent architecture, but the discipline of its empirical framing. The authors take a method that had looked promising in social-science-style survey simulation and ask a harder, more practically relevant question for HCI and industry: can interview-informed agents help with product discovery, where people are reacting to speculative concepts rather than reporting on stable attitudes or established constructs? The answer is nuanced and, importantly, useful. The agents do not reproduce the specific individuals they are grounded in with adequate fidelity, and the paper is clear about that failure. But the study also shows that this does not automatically make the method worthless. In this case, the agents approximate population-level response distributions better than simpler baselines, which suggests a constrained role for simulation in early-stage concept screening and directional exploration. That is the paper’s real contribution: it separates identity fidelity from decision utility. Many readers would assume that if a simulated participant cannot stand in for the actual person, then the whole enterprise collapses. The authors instead show that the relevant standard depends on the task. If the goal is to understand a particular worker’s workflow, trust concerns, or adoption barriers, the method is not good enough. If the goal is to cheaply compare broad reactions across concept variants, distributional calibration may still be informative. The paper earns credibility because it does not hide the weaknesses. It discusses the narrow scope of the case study, the dependence on the interview protocol, the simplicity of the architecture, and the limitations of LLM-as-a-judge evaluation. It also avoids claiming general validity beyond AI document workflows and this specific simulation pipeline. Overall, this reads as a mature CHI contribution: not a universal endorsement of synthetic users, but a careful calibration of where they may fit into design research and where human-centered methods remain indispensable.

What Changed

Canon before

Before this paper, the strongest precedent for interview-informed generative agents came from social-science-style survey simulation, where the task was to reproduce responses on established instruments. In product discovery, by contrast, LLM simulation was more speculative: people could imagine it helping with concept screening, but there was much less direct validation against the same participants’ reactions to novel design concepts in a realistic discovery workflow.

Departure from common sense

The paper’s key departure from common-sense expectations is that a simulation method can be practically useful even when it fails at the seemingly more intuitive benchmark of reproducing the exact individual it is supposed to model. Rather than treating poor person-level fidelity as a total failure, the authors argue that distributional alignment across a population may still support early-stage concept screening, where teams care more about aggregate directional signals than about faithfully reconstructing any one participant.

Actual novelty

The novelty is a bounded empirical validation study of interview-informed generative agents in a product-discovery setting rather than a new model or architecture. The paper tests whether agents grounded in workflow interviews can simulate concept-test responses for four AI document-workflow concepts, using both scalar measures such as TAM and NPS and open-ended feedback. Its main contribution is the finding that these agents are distribution-calibrated but identity-imprecise, which clarifies where simulation may help in practice and where it should not be trusted as a substitute for direct user research.

Evidence

The evidence is substantial for a scoped validation claim. The paper reports a focused case study with 51 knowledge workers, workflow interviews, personalized agents, four AI document-workflow concepts, and comparisons between agent outputs and the same participants’ concept-test responses. It evaluates both quantitative alignment for categorical responses and qualitative similarity for open-ended responses, then interprets the results through discussion and limitations sections. The evidence strongly supports the paper’s central bounded claim that interview-informed agents can approximate population-level distributions better than individual identities in this setting, while also documenting important weaknesses in qualitative fidelity, evaluation methodology, and scope.

“ We provide, to our knowledge, the first systematic evaluation of interview-informed generative agents on early-stage product concept testing, combining TAM, NPS, and open-ended feedback for four AI document workflow concept”

actual novelty · 1 Introduction · confidence 0.82

“mparisons (95.0% Confidence Interval) for the Gwet’s AC2 results in Figure 4 . Figure 12: Distribution of the four constructs for the “Audio Assistant” concept as well as the Wasserstein distance comparing annotators themselves and the three different agent designs. Figure 13: Distribution of the four constructs for the “Highlight Assistant” concept as well as the Wa”

departure from common sense · 7 Discussion · confidence 0.78

“ Crossref Google Scholar [10] ChatPDF. 2025. ChatPDF. https://www.chatpdf.com/ . Accessed: 2025-11-25. Google Scholar [11] Fred D. Davis. 1989. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Quarterly 13, 3 (Sept. 1989), 319–340”

limitation · 8 Limitations · confidence 0.90

“06899 Crossref Google Scholar [4] Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, and Alexander Tessier. 2025. Elicitron: A Large Language Model Agent-Based Simulation Framework for Design Requirements Elicitation. Journal of Computing and Information Science in Engineering 25, 2 (J”

validation scope · 7 Discussion · confidence 0.84

Limits

Method limits

The method is limited by the narrow case-study setup, the relatively simple retrieval-reflection-answering architecture, and the dependence of outcomes on the specific interview protocol used to ground agents. The paper also notes that its evaluation stack captures only part of what matters in design discovery, especially because LLM-as-a-judge procedures can introduce bias and model-specific failure modes. These constraints mean the study is better read as an initial validation of one simulation pipeline than as a definitive test of all interview-informed agent approaches.

Deployment limits

The paper is explicit that these agents should not replace authentic user interviews when teams need participant-specific explanations, workflow understanding, trust dynamics, or adoption barriers. Their plausible use is limited to early-stage concept screening and directional exploration, where approximate population-level response patterns may be sufficient. Even there, the method depends on careful human data collection and should be treated as a complement to user research rather than a standalone decision system.

Boundary conditions

The findings are bounded to a product-discovery case study involving 51 U.S.-based knowledge workers, four AI document-workflow concepts, and one interview-informed agent architecture. The claims are most applicable to settings where users are reacting to hypothetical concepts, where aggregate distributions matter more than exact individual prediction, and where teams can tolerate approximate signals. The paper does not establish that the same results would hold for other domains, higher-stakes contexts, richer grounding modalities, or different model and retrieval designs.

Position in field

This paper sits in the emerging literature on LLM-based human simulation as a scope-setting validation study for HCI and product discovery. Relative to prior work that emphasized stronger performance on social-science instruments, it shows that transfer into concept evaluation is not straightforward and that fidelity depends heavily on domain, task, and measurement choices. Its contribution to the field is therefore less about proving that synthetic users work in general and more about clarifying a narrower methodological argument: interview-informed agents may complement early discovery by approximating population-level patterns, but they remain poor substitutes for situated, participant-level inquiry.

Abstract

Large language models (LLMs) have shown strong performance on standardized social science instruments, but their value for product discovery remains unclear. We investigate whether interview-informed generative agents can simulate user responses in concept testing scenarios. Using in-depth workflow interviews with knowledge workers, we created personalized agents and compared their evaluations of novel AI concepts against the same participants’ responses. Our results show that agents are distribution-calibrated but identity-imprecise: they fail to replicate the specific individual they are grounded in, yet approximate population-level response distributions. These findings highlight both the potential and the limits of LLM simulation in design research. While unsuitable as a substitute for individual-level insights, simulation may provide value for early-stage concept screening and iteration, where distributional accuracy suffices. We discuss implications for integrating simulation responsibly into product development workflows.