Cocoa: Co-Planning and Co-Execution with AI Agents
Cocoa is a strong CHI systems contribution because it turns a broad critique of rigid agent workflows into a concrete interaction design: editable shared plans, explicit delegation, and notebook-like execution inside documents. The evaluations suggest this design improves steerability without obvious usability loss, though the evidence is still bounded to short-term research settings and a CS-heavy participant pool.
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- generative knowledge typical · 35/268
- Novelty type
- artifact typical · 20/268
- Abstraction level
- artifact typical · 19/268
- Generalization target
- task class typical · 63/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
Cocoa’s main contribution is not merely that it adds planning to an AI assistant, but that it treats planning and execution as continuously revisable collaborative objects. That is a meaningful departure from the dominant chatbot pattern and from systems that isolate planning before execution. By embedding interactive plans directly in a document editor, the paper gives users a shared representation of task structure, role assignment, and progress, which makes delegation visible and editable rather than implicit. This is a compelling HCI move because it translates mixed-initiative ideals into concrete interface mechanisms: users can rewrite steps, reassign them, inspect intermediate outputs, and move back and forth between planning and execution as the task unfolds. The evidence base is also relatively solid for a CHI artifact paper. The formative study motivates the design space, the implementation section clearly states what Cocoa is, the lab study compares it against a strong status-quo chat baseline, and the deployment adds ecological validity. The reported results support the claim that Cocoa improves steerability while preserving ease of use, and the qualitative framing helps explain why: users can intervene earlier and more locally, instead of relying on coarse prompt-and-wait cycles. That said, the paper should still be read as a strong demonstration in a specific domain rather than a universal recipe. The baseline does not isolate every design factor, the deployment is short, and the participant pool is concentrated in CS and adjacent research cultures. The authors are appropriately candid about these limits, including document linearity, lack of multimodality, and no code execution. Overall, this is an excellent best-paper-level contribution because it offers both a polished system and a reusable interaction thesis for future human-agent workflow design.
What Changed
Canon before
Many existing AI agent systems treat planning and execution as rigidly separate stages or support limited human interaction only to fix faults in "autonomous" workflows, lacking flexibility in delegation and iteration during collaborative research tasks.
Departure from common sense
This paper rejects the assumption that planning and execution in human-agent systems should be cleanly separated or that human involvement should mainly be reactive correction. Instead, it argues for interleaving co-planning and co-execution so users and agents can jointly edit plans, assign responsibility step by step, and revise workflows after partial results.
Actual novelty
The paper introduces Cocoa, a document-editor-based system for scientific research that uses interactive plans to support mixed-initiative co-planning and co-execution. Its novelty is the combination of editable shared plans, explicit assignment of steps to either user or agent, notebook-like stepwise execution, and direct refinement of intermediate outputs within one integrated workflow.
Evidence
Evidence comes from a formative study with 9 researchers, a system implementation section describing Cocoa’s interactive-plan design, a within-subjects lab study with 16 researchers against a chat baseline, and a 7-day field deployment with 7 participants. The lab results report significantly greater steerability and more active co-execution behavior, while the deployment shows how explicit delegation and interleaving were used in day-to-day research work.
“4 Cocoa: System Walkthrough and Implementation We present Cocoa, a system that embeds an AI agent into a document editor and provide an affordance we call interactive plans that allow users to collaboratively plan (co-plan) with the agent—the agent proposes an initial plan of execution to tackle a user request that the user can edit to their liking.”
actual novelty · 4 Cocoa: System Walkthrough and Implementation · confidence 0.98
“ On the other hand, researchers have developed systems with user-guided workflows, where the user is in full control of task planning, while the agent executes scoped subtasks [1, 33, 47, 99]. Whether agent- or user-guided, these systems encode a particular configuration of human-agent collaboration—there are few affordances to flexibly delegate agency by altering who is guidin”
departure from common sense · 1 Introduction · confidence 0.96
“blating parts of our design. For example, our baseline could have been an interactive planning interface that did not support stepwise execution. In future work, we can expand our participant pool and run more studies with ablations of Cocoa to pinpoint which particular affordances participants valued most. Further, the underlying agent used in Cocoa also had some technical limitations”
limitation · 9 Limitations and Future Work · confidence 0.99
“ A lab study (n = 16) found that Cocoa enabled steerability without sacrificing ease-of-use, and a week-long field deployment (n = 7) showed how researchers collaborated with Cocoa to accomplish real-world tasks. Figure 1: On the left: an illustration of co-planning: the agent proposes a plan of action in response to a user request.”
validation scope · Abstract · confidence 0.97
Limits
Method limits
The evaluation is strong for an HCI systems paper but still bounded: the lab study involved 16 researchers in 90-minute sessions on research-document tasks, and the field deployment lasted only 7 days with 7 participants. The comparison baseline was a chatbot-style interface rather than a fuller ablation set, so the contribution of individual Cocoa affordances is not fully isolated.
Deployment limits
Deployment evidence is limited to a short 7-day study with researchers in CS and CS-adjacent domains. This constrains claims about long-term adoption, broader workplace integration, and transfer to non-research or non-CS settings.
Boundary conditions
Cocoa is designed for scientific research workflows centered on documents, literature-grounded tasks, and iterative planning/execution. Its usefulness depends on contexts where users benefit from explicit delegation and can inspect or edit intermediate outputs. The implementation also depends on a GPT-4o-based agent and Semantic Scholar-backed literature tools, so findings may not directly transfer to multimodal, code-heavy, or non-document-centered domains.
Position in field
The paper contributes a concrete interaction pattern for human-agent collaboration that sits between rigid user-led and agent-led workflows. Rather than only advocating more oversight, it operationalizes mixed initiative through interactive plans, explicit delegation, and notebook-like execution inside a document environment, positioning Cocoa as both a system contribution and a design argument for more fluid human-agent workflow control.