← CHI 2026 map

CHI '26 · Best paper · full-paper review · confidence high

Mining Player Experience Trends From Game Reviews Using Large Language Models

Supriya Dutta , Joel Oksanen , Jaakko Väkevä , Shamit Ahmed , Markus Kirjonen , Perttu Hämäläinen

This is a strong best-paper-level contribution because it turns a previously awkward methodological gap—connecting validated player-experience constructs to massive free-text review corpora—into a workable pipeline, then uses that pipeline to produce substantive longitudinal findings while still acknowledging that the proxy measures discourse more directly than experience itself.


Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form
method knowledge typical · 29/268
Novelty type
method typical · 21/268
Abstraction level
field typical · 41/268
Generalization target
field argument typical · 55/268
Validation mode
mixed methods typical · 136/268

Evidence profile

Evidence strength
strong typical · 158/268
Claim alignment
strong typical · 231/268
Overclaim risk
low typical · 53/268

Review Summary

The paper’s main strength is not just that it uses LLM-era tooling on a large dataset, but that it does so in a way that is conceptually anchored to established player-experience questionnaires rather than relying on vague topic or sentiment labels. That matters because it lets the authors ask longitudinal questions about specific constructs such as emotional challenge, meaning, nostalgia, and boredom, instead of merely reporting that reviews became more positive, negative, or topically different over time. The methodological move—treating review-item semantic similarity as a proxy for reviewer agreement—is simple enough to be reusable, yet the paper does not oversell it as a perfect measurement instrument. The appendix validation, thresholding rationale, and mixed-method follow-up analyses make the contribution feel disciplined rather than opportunistic. Empirically, the findings are interesting because they suggest both growth in eudaimonic or reflective experiences and a rise in boredom, while also showing that these shifts are generally compound effects rather than artifacts of one blockbuster title or one dominant genre. The qualitative coding adds interpretive depth by surfacing the reasons behind trends, and the score-correlation analysis increases practical relevance for design and evaluation. The biggest caveat is also the authors’ own: this is a study of review discourse filtered through an embedding-based proxy, not a direct longitudinal panel of player experiences with ground-truth questionnaire responses. Even so, within those bounds, the paper is methodologically inventive, carefully validated, and field-shaping in how it demonstrates a scalable route from messy public review text to structured player-experience insight.

What Changed

Canon before

Player experience research has traditionally relied on validated questionnaire studies needing large recruitment and on rudimentary NLP techniques for game review analysis, limiting scale, nuance, and temporal longitudinal insight. Narrative and emotional experiences were often studied qualitatively on small datasets. Embedding models and other LLMs were not previously scaled for questionnaire-based trend analysis over large user review datasets.

Departure from common sense

Contrary to the assumption that player experiences remain stable or that negative emotions like boredom would not rise, the work finds increasing trends in eudaimonic experiences such as emotional challenge, meaning, nostalgia, and simultaneously an unexpected rise in boredom. It also argues that these trends are not driven by a single landmark game or genre but by multiple compounding contributors.

Actual novelty

The paper contributes a scalable method that maps free-form game reviews to established player-experience questionnaire constructs using embedding-based semantic similarity as a proxy for reviewer agreement. It then applies that method longitudinally to 152143 Metacritic reviews, combining trend analysis, genre/game contribution analysis, qualitative coding of reasons, and score correlations.

Evidence

The paper grounds its contribution with exact methodological description of review-item semantic similarity, reports longitudinal trend findings across three questionnaires, shows that observed trends are compound rather than dominated by single games or genres, and includes an appendix validation against human-rated agreement showing moderate-to-strong correlations and better-than-random threshold classification. The authors also explicitly delimit claims by noting that they measure review discourse rather than actual experience trends and lack paired questionnaire ground truth.

“ Furthermore, we conduct various additional analyses to shed light on the underlying reasons for the trends, and link our findings to relevant academic discourse such as a shift or expansion of focus from hedonic to eudaimonic game experiences [10, 19, 24, 25]. Contribution. We make a methodological and empirical contribution [77] in the form of the first large-sca”

actual novelty · 1 Introduction · confidence 0.96

“ndings of our study. A general conclusion that one can draw based on all the trend breakdowns into genres, games, and qualitative codes is that the trends observable in our data are produced by multiple compounding factors, with no clear dominating contributors such as a single landmark game or genre”

departure from common sense · 7 Discussion · confidence 0.92

“An obvious limitation of our work is that while game reviews no doubt reflect player experiences, to some degree, we can only directly measure trends in review discourse instead of actual experience trends. Confounding factors such as language drift and shifting evaluation norms can also influence the share of reviews above threshold. ”

limitation · 8 Limitations and Future Work · confidence 0.99

“s produce meaningful results on a large dataset of reviews. The threshold-based classification analysis further confirms this. The AUC and F1 scores could be better but the AUC does indicate clearly better than random performance (which would yield AUC=0.5). As expected, the histogram of all the human-rated review-item agreements in Figure 10 indicates ”

validation scope · A Validation: Review-Item Cosine Similarity as an Approximation for Review-Item Agreement · confidence 0.95

Limits

Method limits

The semantic-similarity proxy is explicitly imperfect, requiring averaging and thresholding to reduce noise. The study lacks paired questionnaire ground truth for the reviewed players, so it cannot directly quantify reviewer-level accuracy. Additional risks include language drift, shifting evaluation norms, measurement invariance across changing reviewer populations, and the brevity and variable quality of user reviews.

Deployment limits

The analysis is based on English Metacritic user reviews from 2010-2024 and may not transfer directly to other platforms, languages, or longer expert reviews without methodological adaptation such as chunking or prompting-based approaches. Correlations with review scores are informative but not causal guidance for design decisions.

Boundary conditions

Claims are bounded by the selected questionnaires (PXI, CORGIS, AESTHEMOS), the embedding-similarity and thresholding pipeline, and the assumption that reviews honestly foreground salient experiences. Trend interpretations are limited to review discourse in this corpus and may be affected by population shifts, sarcasm, and broader cultural changes over time.

Position in field

This paper extends player-experience research by connecting established questionnaire constructs with large-scale review mining through modern embedding methods. Relative to earlier game-review work centered on adjective counts, topic modeling, sentiment analysis, or small manual qualitative samples, it offers a more scalable bridge between validated measurement traditions and naturally occurring player discourse.

Abstract