CHI '26 · Honorable mention · full-paper review · confidence medium-high

Beyond Words: Measuring User Experience through Speech Analysis in Voice User Interfaces

Yong Ma , Xuesong Zhang , Xuedong Zhang , Natalia Bartłomiejczyk , Seungwoo Je , Adrian Holzer , Morten Fjeld , Andreas Martin Butz

DOI PDF Program page

This is a credible and nicely scoped CHI contribution: it reframes VUI evaluation around users’ own speech and backs that idea with a controlled study plus predictive modeling. The paper is strongest as an implicit measurement contribution, though its claims should stay tied to correlational evidence and the study’s constrained setting.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: descriptive knowledge typical · 92/268
Novelty type: measurement less common · 3/268
Abstraction level: task typical · 36/268
Generalization target: task class typical · 63/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper is interesting because it shifts the measurement lens in voice-interface research: instead of treating speech only as an interaction medium, it treats speech as a source of UX evidence. That is a sensible but nontrivial move, and the paper supports it with a within-subjects study of 49 participants across three personas and three scenarios, plus analysis of temporal, spectral, and linguistic features alongside standard UX and mood/stress measures. The empirical story is therefore more than a conceptual proposal; it shows that some speech features correlate with self-reported satisfaction and experience, and that a model trained on speech features can classify UX levels with promising accuracy. As a CHI contribution, that makes the paper a solid measurement-oriented piece with practical implications for adaptive VUIs. At the same time, the paper’s own limitations matter a lot for interpretation. The authors explicitly note heterogeneous remote recording conditions, the possibility that UEQ+ benchmark labeling conflates constructs, the inability to make causal claims because multiple assistant characteristics varied together, and limited generalizability because participants were primarily English-speaking. So the right reading is not that speech is a universal replacement for questionnaires, but that speech is a plausible complementary signal for implicit UX sensing in a constrained experimental setting. The work is strongest when framed as a validated proof-of-concept for a task class, not as a field-wide solution.

What Changed

Canon before

UX evaluation for voice assistants has largely relied on task performance metrics and self-report questionnaires rather than users’ own speech as a measurement signal.

Departure from common sense

The paper’s core move is to treat the user’s speech itself as a UX sensor, rather than only using task outcomes or questionnaires after the fact. That is a meaningful departure from the usual evaluation pattern in VUI work, where vocal output is not the primary measurement channel.

Actual novelty

The paper’s novelty is in showing that speech-derived features can be linked to UX labels and used to train a classifier that predicts UX levels from speech alone. The contribution is not just a conceptual suggestion; it is an empirical demonstration that compact acoustic, prosodic, and linguistic markers can support implicit UX measurement.

Evidence

The paper reports a within-subjects study with 49 participants, three VA personas, and three usage scenarios. It analyzes temporal, spectral, and linguistic speech markers alongside standardized UX measures and mood/stress ratings, then evaluates ML classification of UX levels with stratified 5-fold cross-validation. The authors also state limitations around remote audio variability, label construction, causal inference, and English-speaking participants.

“ Furthermore, a machine learning model trained on speech features achieved promising accuracy in classifying UX levels, indicating that this might be a reasonable alternative to self-report instruments”

actual novelty · Abstract + Discussion (speech features correlate with UX and enable classification) · confidence 0.72

“ Together, these contributions provide methodological guidance and empirical evidence for using speech as a real-time, low-friction proxy for UX evaluation and inform the design of adaptive VUIs that can respond dynamically to users’ vocal behavior during interaction”

departure from common sense · Abstract/Introduction (motivation for speech-based UX sensing) · confidence 0.76

“ran Javed, Hassan Aqeel Khan, Ali Raza, and Zubair Saeed. 2024. Code-mixed street address recognition and accent adaptation for voice-activated navigation services. IEEE Access (2024). Google Scholar [77] Hien Trang Nguyen. 2024. Enhancing Error Handling User Experience With Voice User Interface”

limitation · Limitations and Future Work (Section 8.1 Limitations) · confidence 0.82

“ We present a within-subjects study (N=49) that systematically compared three VA personas across three usage scenarios to investigate whether speech-derived audio features can serve as a proxy for user experience (UX)”

validation scope · Abstract + Study design/UX classification + ML evaluation (RQ3) · confidence 0.70

Limits

Method limits

The study is controlled and within-subjects, but the paper itself notes that remote audio collection introduced heterogeneous recording conditions, and that the concurrent variation of multiple assistant characteristics limits causal claims. The UX labels are also derived through a benchmark framework, which may blend constructs.

Deployment limits

The approach is promising for real-time sensing, but the paper’s own limitations indicate that deployment would need robust handling of noisy remote audio, careful label interpretation, and validation beyond English-speaking participants before broad use.

Boundary conditions

Findings are bounded by the study’s three personas, three scenarios, and English-speaking participant pool. The paper also cautions that assistant characteristics varied together, so the speech signal should be interpreted as correlational rather than causal evidence of UX.

Position in field

This sits at the intersection of VUI evaluation, affective computing, and implicit UX sensing. It advances the field by proposing speech as an alternative measurement channel for UX, while remaining grounded in a controlled empirical study rather than a purely speculative framework.

Abstract

Voice assistants (VAs) are typically evaluated through task performance metrics and self-report questionnaires, but people’s voices themselves carry rich paralinguistic cues that reveal affect, effort, and interaction breakdowns. We present a within-subjects study (N=49) that systematically compared three VA personas across three usage scenarios to investigate whether speech-derived audio features can serve as a proxy for user experience (UX). Participants’ speech was analyzed for temporal, spectral, and linguistic markers, alongside standardized UX measures, brief mood and stress ratings, and a post-study questionnaire. We found correlations between specific speech features and self-reported satisfaction and experience. Furthermore, a machine learning model trained on speech features achieved promising accuracy in classifying UX levels, indicating that this might be a reasonable alternative to self-report instruments. Our findings establish speech as a viable, real-time signal for implicitly measuring UX and point toward adaptive VUIs that respond dynamically to emotional and usability-related vocal cues.