CHI '26 · Honorable mention · full-paper review · confidence medium-high

Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Hamna Hamna , Gayatri Bhat , Sourabrata Mukherjee , Faisal M. Lalani , Evan Hadfield , Divya Siddarth , Kalika Bali , Sunayana Sitaram

DOI PDF Program page

This is a strong CHI-style methodological paper: the main contribution is not a new model, but a community-centered evaluation framework that makes benchmark construction itself participatory. The validation is credible but bounded to one healthcare case study, and the paper is careful enough to acknowledge that LLM judges do not substitute for humans.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: method knowledge typical · 29/268
Novelty type: framework typical · 59/268
Abstraction level: practice typical · 85/268
Generalization target: methodological argument typical · 16/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: medium typical · 32/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s value is in reframing LLM evaluation as a community-grounded design problem rather than a purely technical scoring exercise. The authors explicitly challenge the common benchmark assumption that standardized or simulated tasks are sufficient, especially in healthcare where everyday needs, cultural practices, and local context shape what “good” means. Samiksha is presented as a co-creation pipeline in which community feedback informs what to evaluate, how to build the benchmark, and how to score outputs. That makes the contribution primarily methodological and framework-oriented: it is about a process for constructing evaluation artifacts with stakeholders, not just a one-off benchmark instance. The validation is meaningful but not universal. The paper demonstrates the approach in India on multilingual health queries and compares human judgments with LLM-as-judge approaches. The discussion strengthens the paper by showing that automated evaluators can agree with each other while diverging from human judgment, which is an important limitation for any claim that LLM judges can replace people in this setting. Overall, the paper is well aligned with CHI’s interest in situated, participatory, and socially grounded systems work. Its strongest contribution is the methodological argument that benchmark design should be co-produced with the communities affected by the system, while its main limitation is scope: the evidence is compelling for the studied healthcare context, but the paper does not establish broad generality beyond that domain or beyond settings where community collaboration is feasible.

What Changed

Canon before

Prior CHI and HCI benchmark work often evaluates LLMs with generic, domain-specific, or simulated tasks that do not fully reflect lived community needs in healthcare settings.

Departure from common sense

The paper argues against the default assumption that benchmark quality comes from standardized or simulated tasks alone; instead, it treats community-grounded query creation and scoring as necessary for healthcare evaluation because generic benchmarks can miss everyday needs, cultural practices, and nuanced contexts.

Actual novelty

The core novelty is Samiksha, a co-creation pipeline that integrates community feedback into benchmark design, item creation, and scoring, so evaluation is not just about measuring model outputs but about building the evaluation protocol with the affected community.

Evidence

The paper presents a community-centered evaluation pipeline for healthcare chatbot benchmarking and demonstrates it in India with multilingual health queries. The evidence supports a methodological contribution rather than a new model or dataset alone: community input shapes what is evaluated, how the benchmark is built, and how outputs are scored. Validation includes a case study and comparative evaluation of human and LLM judges, with discussion of where automated judging diverges from human assessment.

“ The novelty of our work lies in this co-creation process and in the systematic integration of community feedback at every stage”

actual novelty · Introduction (novelty statement) · confidence 0.70

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capa”

departure from common sense · Abstract/Introduction (Samiksha overview) · confidence 0.66

“ Association for Computational Linguistics, Online, 9241–9250. Crossref Google Scholar [73] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Da”

limitation · Discussion: Human Evaluation vs. LLMs-as-Judges · confidence 0.74

“ We conduct a fine-grained evaluation of three LLMs using both human annotators and LLM-as-judge methods [ 73 ], utilizing rubrics created in consultation with CSOs”

validation scope · Abstract/Methodology/Evaluation setup · confidence 0.62

Limits

Method limits

The evaluation is demonstrated in one healthcare context in India and on a multilingual set of queries; the paper’s own discussion indicates that automated judges compress scores and diverge from human judgment, limiting claims about replacing human evaluation.

Deployment limits

The approach depends on community participation and CSO collaboration, so deployment requires access to local stakeholders, culturally informed facilitation, and enough resources to co-create and score benchmarks.

Boundary conditions

Best suited to domains where lived experience, cultural context, and local language matter strongly; less directly transferable to settings where community co-creation is infeasible or where benchmark tasks are narrowly technical and context-insensitive.

Position in field

Positions benchmark construction as a participatory, community-centered practice for healthcare LLM evaluation, extending beyond conventional benchmark design by embedding stakeholder input into the evaluation pipeline and by showing how community consultation can shape query curation, rubric design, and response scoring in a multilingual Indian healthcare setting.

Abstract

Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.