CHI '26 · Best paper · full-paper review · confidence high

Bloom: Designing for LLM-Augmented Behavior Change Interactions

Matthew Jörke , Defne Genç , Valentin Teutschbein , Shardul Sapkota , Sarah Chung , Paul Schmiedmayer , Maria Ines Campero , Abby C King , Emma Brunskill , James A. Landay

DOI PDF Program page

Bloom is compelling because it does not merely add chat to a health app; it shows how an LLM can be woven into established behavior-change interactions and then evaluated against a no-LLM control. The key contribution is the reframing: short-term activity did not clearly improve, but mindset, enjoyment, and engagement did.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: generative knowledge typical · 35/268
Novelty type: artifact typical · 20/268
Abstraction level: system typical · 61/268
Generalization target: design family typical · 38/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: low typical · 53/268

Review Summary

Bloom stands out as a strong CHI-style systems paper because its contribution is not just that it uses an LLM, but that it asks where LLMs actually add value inside a broader behavior-change intervention. The authors position Bloom against a literature that has largely emphasized text-only LLM interactions, then build a richer iOS system that combines conversational coaching with goal setting, action planning, tracking, visualization, ambient display, and notifications. That integration is the real design novelty: the chatbot is not isolated, but acts as both an intervention channel and a source of qualitative context for personalization elsewhere in the system. The empirical contribution is also more careful than many early LLM papers. The authors explicitly describe the field study as formative, exploratory, and primarily design-oriented rather than as a definitive efficacy trial. That matters because the results are nuanced. Both conditions improved physical activity, but the LLM condition did not show a clear short-term quantitative advantage on wearable outcomes. Instead, the evidence points to stronger psychological and motivational shifts: participants reported stronger beliefs in the benefits of activity, greater enjoyment, and more self-compassion. This is a meaningful departure from the common expectation that LLM augmentation should immediately outperform simpler systems on behavioral metrics. The paper’s more credible claim is that LLMs may be especially useful for shaping mindset, engagement, and flexible planning, which are plausible precursors to longer-term change. The safety work also strengthens the package. Rather than treating safety as a generic disclaimer, the paper reports a benchmark dataset and evaluation process for harmful coaching outputs. Just as importantly, the authors avoid overstating that result: they say the filters provide meaningful risk reduction, not elimination, and note that larger-scale deployments may need further effort. Overall, Bloom is best read as a rigorous design-and-evaluation contribution that clarifies the likely near-term role of LLMs in digital health: less as magic behavior-change engines, more as relational and contextual components that can enrich established intervention structures when deployed carefully.

What Changed

Canon before

Prior work primarily focused on text-only LLM interactions or on personalization via quantitative data; assumptions were that LLMs would directly improve short-term physical activity behavior by providing improved personal plans or reminders.

Departure from common sense

The paper breaks from the assumption that LLM augmentation directly increases short-term physical activity levels. Instead, it finds that LLMs primarily shift psychological mindsets such as beliefs about benefits, enjoyment, and self-compassion, which may precede longer-term behavior change rather than producing immediate behavioral gains over a strong control.

Actual novelty

The paper presents Bloom, an iOS physical-activity support system that integrates an LLM coaching chatbot with established behavior change interactions such as goal setting, action planning, activity tracking, data visualization, ambient display, and push notifications. It also contributes a safety benchmark dataset for LLM coaching and reports a four-week randomized field study against a no-LLM control.

Evidence

Evidence comes from multiple sources described in the paper’s focused sections: the introduction and abstract define the system contribution and the claim that LLM benefits appear more psychological than immediately behavioral; the safety evaluation reports benchmark construction and residual risk language; and the field-study section states a four-week randomized study with N=54 oriented toward design insights. Together these support a design-oriented systems contribution with moderate-to-strong empirical grounding, while also showing explicit caution about efficacy and deployment risk.

“ To address this gap, we present Bloom, a mobile application for promoting PA that combines an LLM coaching chatbot with established behavior change interactions, including goal setting, action planning, activity tracking, data visualization, an ambient display, and push notifications (Figure 1).”

actual novelty · 1 Introduction · confidence 0.97

“e proportion of participants meeting recommended weekly guidelines, though descriptively, we observed no advantage for the LLM condition in short-term physical activity levels. Instead, our findings suggest that LLMs may be more effective at shifting mindsets that precede longer-term behavior change. Figure 1: Overview of the Bloom application. A (left): A conversation between the user a”

departure from common sense · Abstract · confidence 0.96

“These results provide evidence that our safety filters substantially mitigate risk by detecting and revising harmful outputs, which gave us confidence to deploy the agent in a field study. However, we caution that our findings should be interpreted as evidence of meaningful risk reduction, not elimination, and future efforts may be required for larger-scale deployments.”

limitation · 4.3 Safety Evaluation2 · confidence 0.98

“ed a safety benchmark dataset for LLM coaching with 600 examples to evaluate our system’s safety filters. We evaluate Bloom in a four-week, between-subjects, randomized field study with N = 54 participants, comparing Bloom to a no-LLM control that removes the LLM coach and all LLM augmentation.”

validation scope · 1 Introduction · confidence 0.95

Limits

Method limits

The study is explicitly described as formative, exploratory, and primarily design-oriented, and the authors note it was not designed to establish statistically significant treatment-control differences. The field study lasted four weeks with N=54, which limits inference about longer-term behavioral efficacy.

Deployment limits

Bloom is an iOS application tied to Apple HealthKit and safety filters that reduce but do not eliminate harmful outputs. The authors explicitly caution that larger-scale deployments may require additional safety work.

Boundary conditions

The strongest supported claims concern LLM augmentation in a multimodal physical-activity coaching system over a short, four-week period, especially for mindset, engagement, and planning experiences rather than definitive short-term activity gains. Claims should not be generalized to long-term efficacy or to all health-coaching deployments.

Position in field

This paper advances HCI work on LLMs in health behavior change by moving beyond text-only chat toward multimodal augmentation of established behavior change interactions, while also grounding the system in safety benchmarking and a comparative field deployment.

Abstract

Large language models (LLMs) offer novel opportunities to support health behavior change, yet existing work has narrowly focused on text-only interactions. Building on decades of HCI research on effective behavior change interactions, we present Bloom, an application for physical activity promotion that integrates an LLM-based health coaching chatbot with existing design strategies and UI elements. As part of Bloom's development, we conducted a redteaming evaluation and contribute a safety benchmark dataset. In a four-week randomized field study (N=54) comparing Bloom to a non-LLM control, we observed important shifts in psychological outcomes: participants in the LLM condition reported stronger beliefs that activity was beneficial, greater enjoyment, and more self-compassion. Both conditions significantly increased physical activity levels, doubling the proportion of participants meeting recommended weekly guidelines, though descriptively, we observed no advantage for the LLM condition in short-term physical activity levels. Instead, our findings suggest that LLMs may be more effective at shifting mindsets that precede longer-term behavior change.