CHI '26 · Honorable mention · full-paper review · confidence medium-high

Simple changes to content curation algorithms affect the beliefs people form in a collaborative filtering experiment

Jason W. Burton , Stefan M Herzog , Philipp Lorenz-Spreen

This is a solid CHI paper because it turns a familiar recommender-systems question into a causal claim about belief formation, not just click or preference optimization. The main value is the controlled evidence that ranking objectives can alter consensus and accuracy, though the ecological scope is still constrained by the curated inventory and short exposure window.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: causal knowledge typical · 31/268
Novelty type: empirical finding typical · 68/268
Abstraction level: system typical · 61/268
Generalization target: task class typical · 63/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper is compelling because it makes a relatively simple but important move: instead of treating content curation as a problem of maximizing engagement or user satisfaction, it tests whether ranking choices change the beliefs people form. That is a meaningful CHI contribution because it connects algorithm design to epistemic outcomes, especially consensus and accuracy, in a preregistered controlled experiment with a sizable sample. The abstract supports a clear causal story: simple changes to sampling and ranking produced observable differences in belief outcomes, and the paper reports partial support for bridging-based ranking and intelligence-based ranking as alternatives to engagement-based ranking. The novelty is not a new model architecture or a new dataset; it is the experimental demonstration that ranking objectives can be reoriented toward collective belief effects and that those effects are measurable. At the same time, the evidence packet also makes the limits visible. The authors explicitly note a relatively small content inventory of 72 posts and call for larger, more ecologically valid inventories. That matters because it means the strongest claim is about a controlled collaborative-filtering setting, not about broad real-world platform deployment. So my read is: strong empirical contribution, clear relevance to CHI, and a useful reframing of recommender evaluation, but with scope bounded by the experimental design and content corpus.

What Changed

Canon before

Prior CHI and HCI work on content curation and recommender systems has often emphasized engagement, user preference, or subjective satisfaction, with less direct experimental evidence on how ranking choices shape collective belief outcomes such as consensus and accuracy.

Departure from common sense

The paper’s core result is that modest algorithmic changes in sampling and ranking can measurably shift the beliefs people form, including consensus and accuracy, in a controlled collaborative-filtering setting. That is a non-obvious departure from the common assumption that ranking mainly changes what is seen or liked, not the structure of beliefs.

Actual novelty

The paper’s novelty is in experimentally comparing ranking strategies that use only naturally occurring engagement signals, while evaluating collective belief outcomes rather than only preference or perceived quality. It also positions bridging-based and intelligence-based ranking as concrete alternatives to engagement-based ranking in a preregistered two-wave experiment.

Evidence

The abstract states that a preregistered, two-wave collaborative filtering experiment with N=1,500 showed that simple changes to sampling and ranking affect beliefs, with differences in belief accuracy and consensus. The paper also reports partial support for bridging-based ranking and intelligence-based ranking, and contrasts them with personalized engagement-based ranking. The evidence packet further notes a limitation: the content inventory was only 72 posts, and the authors call for larger and more ecologically valid inventories.

“ First, instead of relying on content annotation and the development of a bespoke AI model, we design ranking algorithms that are agnostic to the substantive content in a post and require only naturally-occurring engagement signals — in this case, upvotes and downvotes — and basic demographic information about users — in this case, users’ stated left-right political leaning (which could alternatively be inferred from users’ online behavior [e”

actual novelty · Introduction contribution / related work · confidence 0.60

“ In a preregistered, two-wave, collaborative filtering experiment (total N = 1, 500), we demonstrate that simple changes to how posts are sampled and ranked can affect the beliefs people form”

departure from common sense · Abstract/Introduction framing · confidence 0.66

“ The output of Study 1 is a content inventory — 72 posts across six topics that have been engaged with by liberals and conservatives — from which we can algorithmically sample and rank posts to generate feeds”

limitation · Discussion / limitations · confidence 0.84

“ Although the primary purpose of Study 1 is instrumental — to create a content inventory to use for Study 2 — we also took the opportunity to preregister and test three hypotheses: H1 There is a significant association between concordance and upvoting, such that posts that are concordant with participants’ prior beliefs are more likely to be upvote”

validation scope · Study 2 hypotheses/results framing · confidence 0.52

Limits

Method limits

The study is experimentally strong but methodologically bounded by a curated inventory of 72 posts and a two-wave design. The evidence packet indicates the authors themselves frame the work as needing larger and more ecologically valid inventories, which suggests the causal claims are best read within a constrained experimental environment.

Deployment limits

The findings speak to ranking policy and algorithm design in curated content systems, but direct deployment claims are limited by the artificial inventory, the collaborative-filtering setup, and the focus on short-term belief formation rather than long-term platform behavior.

Boundary conditions

The results are bounded by the specific post inventory, the two-wave preregistered experiment, and the ranking algorithms considered. The strongest interpretation is for controlled content-curation settings where engagement signals can be used to implement alternative ranking objectives.

Position in field

This paper sits at the intersection of CHI work on recommender systems, algorithmic curation, and the social consequences of ranking. Its contribution is to move beyond engagement and preference toward experimentally measured belief outcomes, especially consensus and accuracy.

Abstract

Content-curating algorithms provide a crucial service for social media users by surfacing relevant content, but they can also bring about harms when their objectives are misaligned with user values and welfare. Yet, few controlled experiments on the potential behavioral and cognitive consequences of this alignment problem exist. In a preregistered, two-wave, collaborative filtering experiment (total N=1,500), we demonstrate that simple changes to how posts are sampled and ranked can affect the beliefs people form. Our results show observable differences in two types of outcomes within statistically constructed groups: belief accuracy and consensus. We find partial support for hypotheses that the recently proposed approaches of "bridging-based ranking" and "intelligence-based ranking" promote consensus and belief accuracy, respectively. We also find that while personalized, engagement-based ranking promotes posts that participants perceive favorably, it simultaneously leads those participants to form more polarized and less accurate beliefs than any of the other algorithms considered.