CHI '26 · Honorable mention · full-paper review · confidence medium-high

TurnStyle: A Framework for Analyzing Human Conversational Behaviors to Predict Success in LLM-Assisted Tasks

Urvi Awasthi , Lisa Krayer , Daniel Sack

TurnStyle is a solid CHI-style framework paper: its main contribution is not a flashy interface but a reusable way to code human turns in LLM conversations and connect them to outcomes. The strongest part is the cross-dataset validation; the main caution is that the evidence is predictive and associational, not causal.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: method knowledge typical · 29/268
Novelty type: framework typical · 59/268
Abstraction level: practice typical · 85/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

TurnStyle is best read as a framework-and-methods contribution that tries to move the field beyond prompt-centric or model-centric evaluation of LLM use. The paper’s central idea is that the meaningful unit of analysis is the human turn in a conversation, and that these turns can be categorized in a way that is both theoretically grounded and operationally useful for sequential modeling. That is a real CHI contribution because it reframes collaboration with LLMs as a behavioral process rather than a one-shot interaction artifact. The evidence summary suggests the authors do more than propose labels: they apply the taxonomy across three outcome-labeled corpora and report predictive regularities, including a pooled HMM result where spending too many turns in an Information Request-dominated state is associated with lower success. That supports the claim that the framework captures behavior linked to outcomes across contexts. At the same time, the paper is careful about scope: it explicitly says the analyses are associational, notes that logs omit outside work or help, and acknowledges possible bias from LLM-based annotation. So the contribution is strong as a reusable analytical framework and as descriptive/methodological knowledge, but it should not be oversold as evidence that the behaviors cause success. The most convincing reading is that TurnStyle offers a durable vocabulary and analysis pipeline for studying human–LLM collaboration, especially in structured tasks where success can be measured and conversational traces are available.

What Changed

Canon before

Prior CHI work on LLM-assisted work often emphasized prompt quality, model capability, or coarse conversation outcomes; fewer frameworks operationalized human behavior at the turn level in a way meant to survive model churn and support sequential analysis across tasks.

Departure from common sense

The paper’s core move is to analyze human–LLM collaboration as a turn-level behavioral trajectory rather than as a static prompt or a model-capability benchmark. That is a meaningful departure because it treats the human side as the analyzable object and explicitly aims for a framework that remains useful as models change.

Actual novelty

TurnStyle’s novelty is a domain-agnostic, turn-level taxonomy that adds LLM-specific behaviors such as information requests and prompt-engineering practices, while being defined at a granularity suitable for sequential modeling and prediction across datasets. The contribution is not just a new label set; it is a framework intended to connect conversational micro-behaviors to outcome prediction across datasets.

Evidence

The paper claims and demonstrates that TurnStyle can be applied across three outcome-labeled corpora and that its behavioral signals predict success. Evidence includes a pooled HMM result showing that time spent in an Information Request-dominated state predicts lower success, plus the paper’s stated use of mixed-effects and sequence analyses across StudyChat, DevGPT, and a workplace reskilling trial. The limitations section also explicitly narrows the scope to associational findings and available STEM-oriented datasets.

“ While there are surface similarities in the kinds of moves that appear, TurnStyle is scoped specifically to task‑oriented human–LLM collaboration and defined at a granularity intended for sequential modeling and outcome prediction, rather than generic conversation tagging”

actual novelty · Methods/Taxonomy development + “What TurnStyle adds beyond prior taxonomies” · confidence 0.74

“ Since TurnStyle only provides a taxonomy to classify human behavior, it enables us to capture fluidity in conversational dynamics as humans use LLMs to navigate across contexts and the human-LLM relationship changes over time in terms of confidence in LLM capabilities and subsequent reliance or lack thereof”

departure from common sense · Abstract/Introduction (framing vs prompting/static evaluation) · confidence 0.66

“TurnStyle: A Framework for Analyzing Human Conversational Behaviors to Predict Success in LLM-Assisted Tasks | Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems”

limitation · Limitations and Further Work (6.1–6.3) · confidence 0.84

“2 Validate the scope of the task with the LLM while in a sequence of conversational Task Management turns Local transition analysis at the subcategory level shows that following Defining task and asking for specific output with a single Agreeing or providing additional information in agreement is enriched (pooled odds ratio OR=1.”

validation scope · Abstract + Results (cross-dataset pooled effects) · confidence 0.78

Limits

Method limits

The analyses are associational rather than causal, and the framework depends on annotation quality plus the availability of turn-level conversational logs. The paper also notes potential bias in LLM-assisted annotation and variability across datasets, which limits how far the predictive patterns can be generalized without further validation.

Deployment limits

Deployment is constrained by the need for detailed conversational traces and by the fact that the strongest evidence comes from domains with objective success metrics and relatively structured tasks. The framework may be less directly transferable to settings where outcomes are ambiguous, logs are incomplete, or human work happens substantially outside the captured conversation.

Boundary conditions

The paper itself limits interpretation to domains such as programming, statistics, and course assignments, and it explicitly avoids causal claims. Its predictive claims are strongest where turn-level behavior is observable, task success is measurable, and the conversational interaction is central to the work.

Position in field

TurnStyle sits between taxonomy-building and predictive behavioral analysis: it extends prior conversation taxonomies into an LLM-specific, sequentially analyzable framework and validates it on multiple datasets. In CHI terms, it is a methods/framework contribution with empirical evidence rather than a pure system demo or a purely descriptive coding scheme.

Abstract

LLMs are widespread across educational and professional environments, often used to tackle tasks beyond users' prior expertise. {However, there is limited work on task-agnostic, turn-level frameworks to characterize human communication styles with LLMs that are linked to better task outcomes.} We introduce TurnStyle, a framework that provides a turn-level taxonomy of human contributions in human–AI conversations, grounded in collaborative learning theory, with a reliability protocol and open-source tooling for public datasets. We apply TurnStyle to two public multi-turn corpora with objective outcomes – StudyChat (college-level assignments) and DevGPT (software engineering issues and pull requests (PRs)) – and to a workplace reskilling randomized trial in which management consultants used ChatGPT for coding, statistics, and machine learning. Across 3,365 conversations with 26,335 human turns spanning the three datasets, mixed-effects and sequence analyses converge on task-agnostic, trainable behaviors that predict task success; this has implications for training, evaluation, and the design of collaborative AI systems.