CHI '26 · Honorable mention · full-paper review · confidence medium-high

Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

Jiawen Deng , Wentao Zhang , Ziyun Jiao , Fuji Ren

This is a solid CHI-style evaluation paper with a clear methodological contribution: it moves beyond static safety checks and shows how failures accumulate over emotionally escalating dialogue. The main strength is the stress-test setup and taxonomy; the main caution is that the evidence comes from synthetic, model-mediated validation rather than real user deployment.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: method knowledge typical · 29/268
Novelty type: method typical · 21/268
Abstraction level: system typical · 61/268
Generalization target: methodological argument typical · 16/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s strongest contribution is not a new conversational model, but a new way to look for failure. The authors explicitly reject the common assumption that alignment can be assessed with static or single-turn checks, and instead argue that breakdowns appear as interactional patterns that intensify over multi-turn emotional escalation. That is a meaningful shift for CHI because it changes the evaluation target from isolated outputs to evolving dialogue behavior. The novelty is method-centered: a persona-conditioned user simulator with staged emotional pacing is used to stress-test agents, and the resulting failures are organized into a taxonomy that includes affective misalignments, ethical guidance failures, and cross-dimensional trade-offs. The evidence base is reasonably sized for a synthetic study: 2,980 simulated dialogues across 298 scenarios, five models, and two pacing settings, with LLM-as-judge analysis and some human validation of judge reliability. That supports the paper’s descriptive claims about recurring breakdown patterns, but it does not fully establish external validity in real-world emotionally sensitive interactions. The authors are also appropriately candid about limitations: the pipeline relies on LLMs from the same model family for persona extraction, simulation, and judging, emotion pacing is validated with an off-the-shelf classifier, and scenario filtering is not a definitive semantic categorization. So the paper is best read as a strong methodological and diagnostic contribution, with moderate evidence strength and a medium overclaim risk if one were to generalize too far beyond synthetic stress-testing.

What Changed

Canon before

Prior work emphasized emotional benchmarks and static safety checks, with less attention to how conversational alignment changes over multi-turn, emotionally escalating interaction.

Departure from common sense

The paper argues against the common evaluation habit of treating conversational AI safety or empathy as a static property. Instead, it frames breakdowns as interactional phenomena that emerge and intensify across multi-turn emotional trajectories, which shifts the unit of analysis from isolated responses to evolving dialogue.

Actual novelty

The paper’s novelty is the combination of a persona-conditioned user simulator with staged emotional pacing to stress-test agents in multi-turn dialogue, followed by a taxonomy of observed breakdowns. This is presented as a way to diagnose interactional failures in emotionally and ethically sensitive contexts rather than only scoring benchmark outputs.

Evidence

The paper validates its claims through synthetic multi-turn dialogue generation and analysis. It reports 2,980 simulated dialogues across 298 scenarios, 5 models, and 2 pacing settings, then uses LLM-as-judge and breakdown annotations to compare baseline persona-only versus pacing-enabled conditions. The authors also describe limited human validation of judge reliability and note several methodological caveats.

“ Our main contributions are as follows: • We propose an interactional evaluation approach that leverages a persona-conditioned simulator with staged emotion pacing to probe conversational agents in emotionally and ethically sensitive contexts”

actual novelty · Abstract + Contributions + Method overview (persona-conditioned simulation, taxonomy) · confidence 0.70

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Conversational AI is increasingl”

departure from common sense · Introduction / Related Work (evaluation gap) · confidence 0.74

“ First, because the persona extraction, user simulation, and LLM-as-judge components all rely on large language models from the same model family, the system may inherit same-source normative biases”

limitation · Limitations and Future Directions · confidence 0.84

“ To support a large-scale comparison between the baseline (persona-only) and pacing-enabled conditions, we first produce breakdown annotations for all 2,980 simulated dialogues (298 scenarios × 5 models × 2 pacing settings”

validation scope · Experiment setup + LLM-as-judge validation + breakdown annotation reliability · confidence 0.80

Limits

Method limits

Validation is based on simulated dialogues rather than live user interactions, and the pipeline depends on LLM-based persona extraction, simulation, and judging. The authors also note that emotion-pacing validation uses an off-the-shelf classifier and that scenario filtering is not a definitive semantic categorization.

Deployment limits

The findings are most directly applicable to stress-testing and diagnosing conversational agents in synthetic, value-sensitive dialogue settings. Transfer to real-world deployment contexts will depend on whether the simulated personas, pacing, and judged breakdowns track actual user behavior and harms.

Boundary conditions

The approach is bounded by the quality of persona extraction, the realism of staged emotional pacing, and the extent to which synthetic scenarios capture ethically sensitive interactions. The authors explicitly caution that same-source model biases and classifier limitations may shape results.

Position in field

This work sits at the intersection of conversational AI evaluation, affect-sensitive interaction, and value-sensitive design. Its main contribution is methodological: it reframes evaluation around dynamic breakdowns in emotionally and ethically charged dialogue and packages that framing into a simulator-plus-taxonomy workflow.

Abstract

Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and how do these affect dialogue quality? To stress-test chatbot performance, we develop a persona-conditioned user simulator capable of engaging in multi-turn dialogue with psychological personas and staged emotional pacing. Our analysis reveals that mainstream models exhibit recurrent breakdowns that intensify as emotional trajectories escalate. We identify several common failure patterns, including affective misalignments, ethical guidance failures, and cross-dimensional trade-offs where empathy supersedes or undermines responsibility. We organize these patterns into a taxonomy and discuss the design implications, highlighting the necessity to maintain ethical coherence and affective sensitivity throughout dynamic interactions. The study offers the HCI community a new perspective on the diagnosis and improvement of conversational AI in value-sensitive and emotionally charged contexts.