Evalet: Evaluating Large Language Models through Functional Fragmentation
Evalet’s main contribution is not just a new interface, but a reframing of LLM evaluation itself: from trusting a single holistic score to inspecting how specific fragments function against criteria. That is a meaningful CHI move because it targets a real interpretability gap in judge-model workflows.
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- method knowledge typical · 29/268
- Novelty type
- method typical · 21/268
- Abstraction level
- system typical · 61/268
- Generalization target
- methodological argument typical · 16/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
Evalet is strongest as a methodological reframing of LLM-as-a-judge practice rather than as a narrow UI contribution. The paper identifies a genuine pain point: holistic scores and justifications often leave practitioners unable to tell which parts of an output actually mattered, or whether the judge’s reasoning aligns with the intended criteria. Functional fragmentation responds by decomposing outputs into key fragments and assigning rhetorical functions to those fragments, then surfacing them in Evalet for inspection, rating, and comparison. That is a departure from the common-sense assumption that better judge evaluation simply means better scores or better explanations; here, the unit of analysis becomes the fragment-function relation. The validation is appropriately mixed: a technical evaluation plus a within-subjects study with practitioners (N=10). The reported gain in identifying misalignments suggests the approach has practical value, but the evidence is still bounded by the study’s scale and by the specific tasks used. The paper is also refreshingly explicit about limitations: fragment-level focus may miss overall quality, the method does not model the relative importance of functions, and it depends on the evaluator reliably surfacing key fragments. Those caveats matter because they define the boundary between a useful inspection aid and a complete evaluation framework. Overall, this is a solid CHI contribution: a novel method/system with a clear user-facing problem, credible validation, and a well-scoped set of limitations that prevent overclaiming.
What Changed
Canon before
LLM-as-a-judge evaluation is typically treated as a holistic scoring problem: a model emits a score and justification, and practitioners inspect those outputs as a whole. This paper reframes the problem around fragment-level inspection and function attribution.
Departure from common sense
The paper argues against treating a judge model’s holistic score as the primary object of validation. Instead, it pushes practitioners to inspect fragment-level functions, because the score can hide which parts of an output drove the judgment and whether those parts align with the intended criteria.
Actual novelty
Functional fragmentation is introduced as a novel LLM-based evaluation method that breaks outputs into key fragments, assigns rhetorical functions relative to criteria, and then uses those fragment-level functions to support inspection, rating, and comparison in Evalet.
Evidence
The paper’s core claim is a method shift from holistic LLM-as-a-judge outputs to fragment-level functional analysis. Evidence includes a technical evaluation and a within-subjects practitioner study (N=10), with the study reporting improved detection of evaluation misalignments. The limitations explicitly note that fragment-level focus may miss overall quality and does not model function priority.
“ We propose functional fragmentation , a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria—surfacing the elements of interest and revealing how they fulfill or hinder user goal”
actual novelty · Abstract + Introduction (functional fragmentation definition) · confidence 0.78
“ Instead of visualizing entire outputs, we extract fragment-level functions from multiple outputs for each criterion and then visualize the space of functions for each criterion—supporting exploration of fine-grained model behaviors within dimensions of interest”
departure from common sense · Abstract/Introduction + functional fragmentation framing · confidence 0.72
“ 2016. "Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining”
limitation · 8.5 Limitations · confidence 0.86
“ To understand how users analyze models and validate evaluations with Evalet , we conducted a within-subjects study with practitioners (N=10) comparing Evalet against a baseline that only provides holistic scores and justifications, like existing LLM-based evaluation”
validation scope · Technical Evaluation (5) + User Study (6) + datasets/tasks · confidence 0.70
Limits
Method limits
The method may fail to represent overall output quality, and it does not account for the relative importance of fragment-level functions. Its effectiveness also depends on reliably surfacing all key fragments; missed fragments weaken the analysis.
Deployment limits
Evalet is most suitable where outputs can be meaningfully decomposed into criterion-relevant fragments and where practitioners can inspect fragment-level functions. It is less suitable when judgments depend on holistic qualities or when fragment extraction is incomplete.
Boundary conditions
The approach depends on the evaluator surfacing the right fragments and on tasks where fragment-level rhetoric can be mapped to evaluation criteria. If fragments are missing or if overall quality is the main concern, the method’s value drops.
Position in field
This work positions itself as a critique of holistic LLM-as-a-judge evaluation and as a practical alternative for qualitative, fine-grained inspection of model outputs. It contributes a system and method for making judge rationales more inspectable and actionable.