CHI '26 · Honorable mention · full-paper review · confidence medium-high

Towards Aligning Multimodal LLMs with Human Experts: A Focus on Parent–Child Interaction

Weiyan Shi , Kenny Tsu Wei Choo

This is a thoughtful exploratory paper with a clear CHI contribution: it reframes multimodal alignment as a two-stage problem and shows that descriptive agreement is easier than interpretive agreement. The empirical scope is modest, but the paper is honest about that and the results are useful as a boundary-setting case study.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: method knowledge typical · 29/268
Novelty type: method typical · 21/268
Abstraction level: task typical · 36/268
Generalization target: methodological argument typical · 16/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s main strength is conceptual clarity. Rather than treating multimodal LLM alignment as a single end-to-end prediction problem, it decomposes expert reasoning into observation and judgement, then tests whether an MLLM can track each layer separately. That is a sensible and useful move for a socially situated clinical task, because it exposes a common failure mode in AI-for-HCI work: models may reproduce surface descriptions while still missing the criteria experts use to make final assessments. The reported results support that framing. Alignment is better when the task is to describe cues such as gaze, action, and vocalisation, and weaker when the model must infer the judgement itself. The many-shot prompting results further suggest that prompting can help on the more common classes, but not on the rare Poor class, which is exactly where one would expect sparse supervision and ambiguous criteria to hurt. The paper is also appropriately cautious: it explicitly calls the work exploratory and notes that Stage 2 uses manually corrected Stage 1 descriptions, which means the judgement results are an upper bound rather than a fully end-to-end system result. That limitation matters, but it does not undermine the contribution; it actually sharpens it by making clear what has and has not been demonstrated. My main reservation is scope. With only three SLPs and a skewed dataset, the paper cannot support broad claims about clinical generalization or robust deployment. Still, as a CHI paper, it succeeds as a case-based probe that identifies a promising decomposition for future work and gives the field a concrete example of where multimodal alignment is feasible and where it remains fragile.

What Changed

Canon before

Prior CHI work on multimodal LLMs and expert alignment typically assumes direct end-to-end prediction or broad annotation agreement; this paper instead separates observation from judgement in a socially situated clinical reasoning task.

Departure from common sense

The paper’s core move is to avoid asking the model to directly infer expert judgement from raw video. Instead, it decomposes the task into observation and judgement, arguing that alignment may be achievable at the descriptive layer even when expert criteria diverge at the interpretive layer.

Actual novelty

The paper’s novelty is an exploratory two-stage alignment pipeline for parent–child joint-attention analysis: first extract fine-grained behavioural cues such as gaze, action, and vocalisation, then prompt the MLLM to make segment-level Strong/Moderate/Poor judgements from those structured descriptions. The contribution is less a new model than a new way to probe where expert–MLLM alignment breaks.

Evidence

Evidence supports a case-based methodological contribution with limited but concrete validation. The paper reports interviews and annotations with three SLPs, then evaluates a two-stage prompting workflow on 25 videos / 615 segments. Results indicate stronger alignment for observation than judgement, with many-shot prompting helping Strong and Moderate cases but not the rare Poor class. The authors explicitly frame the work as exploratory and note that Stage 2 uses manually corrected Stage 1 descriptions, making the reported performance an upper bound rather than a fully end-to-end result.

“ • We design and evaluate an exploratory MLLM system that aligns with speech-language pathologists’ approaches to joint attention assessment in two stages: (1) observing fine-grained behavioural cues from parent-child interaction videos using expert-informed prompting, achieving up to 85% accuracy across dimensions; and (2) evaluating interaction quality using only structured behavioural descriptions, reaching over 64% average accuracy compared to expert la”

actual novelty · Abstract / Contributions / Section 4 pipeline description · confidence 0.70

“ Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge”

departure from common sense · Abstract / Introduction / Two-stage prompting description · confidence 0.74

“ Taking PRIDE in your home: Implementing home-based Parent–Child Interaction Therapy (PCIT) with fidelity. Handbook of parent-child interaction therapy: Innovations and applications for research and practice (2018), 161–1”

limitation · 5.3 Limitations · confidence 0.95

“ Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge”

validation scope · Discussion + Limitations + Stage 2 results · confidence 0.80

Limits

Method limits

The study is exploratory and small-scale: only three SLPs participated, and the evaluation is based on a limited set of videos and segments. Stage 2 also depends on manually corrected descriptions rather than raw Stage 1 outputs, so the reported judgement performance does not reflect a fully end-to-end pipeline.

Deployment limits

The approach is not yet validated for broader clinical deployment or for diverse interaction types. Its judgement layer appears sensitive to class imbalance and to interpretive disagreement among experts, which limits direct transfer to settings where labels are sparse or criteria are less shared.

Boundary conditions

The strongest alignment appears at the observation layer, where experts share common descriptors. Performance weakens at the judgement layer, especially for the rare Poor class. The dataset is also skewed toward short, likely neurotypical interactions, so the findings should be read as task- and corpus-specific.

Position in field

This sits as a CHI-style exploratory probe into expert–AI alignment for socially situated multimodal analysis. Its value is in showing that decomposition into observation and judgement can reveal where MLLMs align with expert reasoning and where they do not, rather than in claiming a general solution.

Abstract

While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech–language pathologists (SLPs) in analysing joint attention in parent–child interactions, a key construct in early social–communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert–AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.