← CHI 2026 map

CHI '26 · Best paper · full-paper review · confidence high

RAG Without the Lag: Enabling "What-If" Analysis for Retrieval-Augmented Generation Pipelines

Quentin Romero Lauro , Shreya Shankar , Sepanta Zeighami , Aditya Parameswaran

This paper’s real contribution is not a better retriever or prompt recipe, but a better debugging loop for RAG. raggy turns parameter changes that normally trigger slow re-indexing into interactive exploration, and the study usefully shows that experienced developers often reason through retrieval before generation when diagnosing failures.


Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form
method knowledge typical · 29/268
Novelty type
tool typical · 14/268
Abstraction level
practice typical · 85/268
Generalization target
user population typical · 75/268
Validation mode
qualitative study typical · 63/268

Evidence profile

Evidence strength
strong typical · 158/268
Claim alignment
strong typical · 231/268
Overclaim risk
low typical · 53/268

Review Summary

This is a strong CHI paper because it identifies a genuine workflow bottleneck in modern AI development and addresses it with a concrete, well-scoped tool. The authors do not claim to solve RAG quality in the abstract; instead, they focus on the practical problem that retrieval and generation are deeply entangled, while existing development workflows make it expensive to ask even basic what-if questions. raggy’s contribution is therefore a systems-plus-interface intervention: Python primitives let developers express pipelines in familiar code, while the browser interface exposes intermediate states, parameter controls, and comparative inspection in a way that supports active debugging rather than passive reruns. The backend story matters too, because the low-latency experience is enabled by pre-materialized indexes and checkpoints rather than hand-wavy claims of interactivity. The qualitative study is also valuable. It does not prove universal productivity gains, but it does reveal a plausible and important practitioner pattern: developers often inspect retrieval first, even when generation appears suspect, because retrieval quality anchors the rest of the pipeline. That finding helps justify the interface emphasis on retrieval inspection and parameter tweaking. At the same time, the paper is commendably explicit about its limits. The study is short, the task is bounded, the retriever space is precomputed rather than open-ended, and the interface was not tested on much larger corpora. Participants also wanted stronger experiment tracking and provenance support, which suggests the current tool is a compelling prototype and research contribution rather than a finished production environment. Overall, the claims are appropriately matched to the evidence, and the paper stands out by showing how HCI-style tool design can materially change the pace and structure of RAG development work.

What Changed

Canon before

Prior to this work, RAG development was commonly treated as a slow, tightly coupled engineering process: changing chunking or retrieval settings often implied re-indexing, and debugging retrieval versus generation failures was handled through ad hoc iteration rather than an integrated interactive environment.

Departure from common sense

The paper challenges the assumption that RAG experimentation must be batch-like and slow. It argues that developers can do interactive what-if analysis over chunking, retrieval, and pipeline structure by precomputing likely retrieval configurations and exposing them through a live debugging interface, rather than waiting through repeated re-indexing cycles.

Actual novelty

The main novelty is raggy: a developer tool that combines Python primitives for composing RAG pipelines with a browser-based debugging interface and backend support for low-latency what-if analysis via pre-computed indexes and checkpoints. The paper also contributes qualitative findings about how experienced practitioners actually debug RAG systems, especially their retrieval-first workflow.

Evidence

The paper is grounded by direct system description, a qualitative study with 12 experienced RAG practitioners, and explicit discussion of limitations. Evidence supports the core claim that raggy enables low-latency interactive debugging and that participants used it to explore retrieval-first debugging strategies, but the validation remains qualitative and bounded to a one-hour study task on a hospital-document corpus.

“Here, we present raggy, our tool for developing and debugging RAG pipelines. raggy provides developers with a Python library for building RAG pipelines along with an interactive debugging interface. D”

actual novelty · 4 Raggy · confidence 0.98

“ Developers’ ability to explore this vast parameter space is limited by prohibitively slow iteration cycles—changing parameters often requires re-indexing documents, which can take hours [2]. Overall, developers lack integrated tools that address this ”

departure from common sense · 1 Introduction · confidence 0.97

“, use LLMs to synthesize domain-specific debugging primitives based on system architecture and generate corresponding visualizations. For example, in a customer support chatbot, an LLM could analyze retrieved support documents to generate a network visualization showing related issues that customers typically encounter together, based on customer usage patterns”

limitation · 7 Discussion, Limitations and Future Work · confidence 0.96

“To understand how practitioners use raggy and learn more about expert RAG pipeline development strategies, we conducted a user study with 12 participants experienced in RAG pipeline development. Participants completed programming tasks using raggy while thinking aloud”

validation scope · 5 User Study Design · confidence 0.97

Limits

Method limits

The evaluation is a one-hour qualitative study rather than a longitudinal or comparative performance study, so it captures workflow insights and perceived utility more than durable productivity or quality gains in production settings.

Deployment limits

raggy depends on precomputed retrievers and an external vector database, and the paper notes that its histogram and interaction design were not tested on substantially larger corpora. It also does not cover earlier ingestion and preprocessing stages.

Boundary conditions

The contribution is strongest for interactive debugging of RAG pipelines where common retrieval configurations can be pre-materialized and developers are iterating on moderate-scale corpora. Findings come from experienced practitioners working on a hospital QA task, so transfer to other domains or full agentic systems should be made cautiously.

Position in field

This paper sits at the intersection of AI developer tooling, debugging interfaces, and RAG practice. Its contribution is less about improving model quality directly and more about reshaping the developer workflow around retrieval-generation interdependence, making it a notable CHI contribution on interactive tooling for contemporary AI systems.

Abstract