CHI '26 · Honorable mention · full-paper review · confidence medium-high

Evalet: Evaluating Large Language Models through Functional Fragmentation

Tae Soo Kim , Heechan Lee , Yoonjoo Lee , Joseph Seering , Juho Kim

Evalet’s main contribution is not just a new interface, but a reframing of LLM evaluation itself: from trusting a single holistic score to inspecting how specific fragments function against criteria. That is a meaningful CHI move because it targets a real interpretability gap in judge-model workflows.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: method knowledge typical · 29/268
Novelty type: method typical · 21/268
Abstraction level: system typical · 61/268
Generalization target: methodological argument typical · 16/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

Evalet is strongest as a methodological reframing of LLM-as-a-judge practice rather than as a narrow UI contribution. The paper identifies a genuine pain point: holistic scores and justifications often leave practitioners unable to tell which parts of an output actually mattered, or whether the judge’s reasoning aligns with the intended criteria. Functional fragmentation responds by decomposing outputs into key fragments and assigning rhetorical functions to those fragments, then surfacing them in Evalet for inspection, rating, and comparison. That is a departure from the common-sense assumption that better judge evaluation simply means better scores or better explanations; here, the unit of analysis becomes the fragment-function relation. The validation is appropriately mixed: a technical evaluation plus a within-subjects study with practitioners (N=10). The reported gain in identifying misalignments suggests the approach has practical value, but the evidence is still bounded by the study’s scale and by the specific tasks used. The paper is also refreshingly explicit about limitations: fragment-level focus may miss overall quality, the method does not model the relative importance of functions, and it depends on the evaluator reliably surfacing key fragments. Those caveats matter because they define the boundary between a useful inspection aid and a complete evaluation framework. Overall, this is a solid CHI contribution: a novel method/system with a clear user-facing problem, credible validation, and a well-scoped set of limitations that prevent overclaiming.

What Changed

Canon before

LLM-as-a-judge evaluation is typically treated as a holistic scoring problem: a model emits a score and justification, and practitioners inspect those outputs as a whole. This paper reframes the problem around fragment-level inspection and function attribution.

Departure from common sense

The paper argues against treating a judge model’s holistic score as the primary object of validation. Instead, it pushes practitioners to inspect fragment-level functions, because the score can hide which parts of an output drove the judgment and whether those parts align with the intended criteria.

Actual novelty

Functional fragmentation is introduced as a novel LLM-based evaluation method that breaks outputs into key fragments, assigns rhetorical functions relative to criteria, and then uses those fragment-level functions to support inspection, rating, and comparison in Evalet.

Evidence

The paper’s core claim is a method shift from holistic LLM-as-a-judge outputs to fragment-level functional analysis. Evidence includes a technical evaluation and a within-subjects practitioner study (N=10), with the study reporting improved detection of evaluation misalignments. The limitations explicitly note that fragment-level focus may miss overall quality and does not model function priority.

“ We propose functional fragmentation , a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria—surfacing the elements of interest and revealing how they fulfill or hinder user goal”

actual novelty · Abstract + Introduction (functional fragmentation definition) · confidence 0.78

“ Instead of visualizing entire outputs, we extract fragment-level functions from multiple outputs for each criterion and then visualize the space of functions for each criterion—supporting exploration of fine-grained model behaviors within dimensions of interest”

departure from common sense · Abstract/Introduction + functional fragmentation framing · confidence 0.72

“ 2016. "Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining”

limitation · 8.5 Limitations · confidence 0.86

“ To understand how users analyze models and validate evaluations with Evalet , we conducted a within-subjects study with practitioners (N=10) comparing Evalet against a baseline that only provides holistic scores and justifications, like existing LLM-based evaluation”

validation scope · Technical Evaluation (5) + User Study (6) + datasets/tasks · confidence 0.70

Limits

Method limits

The method may fail to represent overall output quality, and it does not account for the relative importance of fragment-level functions. Its effectiveness also depends on reliably surfacing all key fragments; missed fragments weaken the analysis.

Deployment limits

Evalet is most suitable where outputs can be meaningfully decomposed into criterion-relevant fragments and where practitioners can inspect fragment-level functions. It is less suitable when judgments depend on holistic qualities or when fragment extraction is incomplete.

Boundary conditions

The approach depends on the evaluator surfacing the right fragments and on tasks where fragment-level rhetoric can be mapped to evaluation criteria. If fragments are missing or if overall quality is the main concern, the method’s value drops.

Position in field

This work positions itself as a critique of holistic LLM-as-a-judge evaluation and as a practical alternative for qualitative, fine-grained inspection of model outputs. It contributes a system and method for making judge rationales more inspectable and actionable.

Abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria—surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.