CHI '26 · Best paper · full-paper review · confidence medium-high

Sound2Hap: Learning Audio-to-Vibrotactile Haptic Generation from Human Ratings

Yinan Li , Hasti Seifi

A strong CHI contribution: instead of assuming fixed signal-processing rules transfer well to environmental sounds, the paper shows preference variability across sound types and responds with a human-rated generative model that performs better in controlled evaluation across two datasets.

Video Figure

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: generative knowledge typical · 35/268
Novelty type: artifact typical · 20/268
Abstraction level: artifact typical · 19/268
Generalization target: task class typical · 63/268
Validation mode: controlled experiment typical · 47/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

This paper’s strongest contribution is the way it reframes audio-to-vibration generation as a perceptual learning problem rather than a purely signal-processing one. The authors begin from a sensible but often under-tested assumption in prior work: that existing mappings developed for music, games, or narrow sound classes can be reused for broader environmental audio. Their first study directly probes that assumption at scale, using 1,000 clips and 4,000 generated vibrations, and the reported result is not that one baseline wins, but that preferences vary substantially across clips, classes, and categories. That empirical finding matters because it justifies the move to a learned model rather than treating machine learning as novelty for its own sake. Sound2Hap is therefore compelling as an artifact contribution: it is trained from human-rated audio-vibration pairs, evaluated against multiple baselines, and positioned as a perceptually aligned generator for diverse environmental sounds. The second study strengthens the case by testing on new clips from both ESC-50 and BBC sources, which gives at least some evidence of cross-dataset robustness. The contribution is further amplified by the open dataset and tool support described in the paper. At the same time, the available evidence here is still bounded to controlled studies, selected datasets, and the specific baselines implemented by the authors. So the paper is best read as a strong demonstration that perceptually trained generation can outperform fixed mappings for this task class, not as final proof of universal audio-haptic translation.

What Changed

Canon before

Existing audio-to-vibration methods rely on signal-processing rules tuned primarily for music or gaming contexts and show limited generalizability to diverse environmental sounds.

Departure from common sense

The paper argues against the expectation that one established signal-processing mapping should work broadly for environmental sounds, showing instead that preferences vary substantially across clips, classes, and categories.

Actual novelty

The main novelty is Sound2Hap, a CNN-based autoencoder trained from human-rated audio-vibration pairs to generate perceptually aligned vibrotactile signals for diverse environmental sounds, paired with a large rated dataset and comparative evaluation against four signal-processing baselines.

Evidence

The paper grounds its claims in two in-person user studies. Study 1 collected 8,000 ratings on 4,000 audio-vibration pairs from 1,000 ESC-50 clips with 34 participants, showing substantial variation in algorithm preference across sound types. Study 2 evaluated two Sound2Hap variants against class-best signal-processing baselines with 15 participants using new ESC-50 and BBC clips to test cross-dataset generalizability. The focused sections also explicitly frame a discussion of limitations and future work.

“ Using this dataset, we trained Sound2Hap, a CNN-based autoencoder, to generate perceptually meaningful vibrations from diverse sounds with low latency. In Study 2, 15 participants rated its output higher than signal-processing baselines on both audio-vibration match and Haptic Experience Index (HXI), finding it more harmonious with diverse sounds”

actual novelty · Abstract · confidence 0.97

“User preferences toward different algorithms varied across sound clips, classes, and categories. Below, we report preferences at the levels of five major categories, 50 sound classes, and 1,000 individual sound clips”

departure from common sense · 5.2 Results: Insights from User Ratings · confidence 0.96

“9 Discussion Below, we first reflect on the result, then discuss the utility of Sound2Hap and outline its limitations and future work.”

limitation · 9 Discussion · confidence 0.72

“ Step 4 - Final user evaluation (Section 7): A second in-person study with 15 participants showed that both Sound2Hap variants outperformed the best signal-processing baselines on perceptual ratings and HXI measures”

validation scope · 3 Overview of Sound2Hap Design and Evaluation Process · confidence 0.96

Limits

Method limits

The focused sections only explicitly indicate that the paper discusses limitations and future work, without exposing the detailed limitation text. Based on the grounded sections, the review should be read as validated primarily through controlled in-person studies on environmental sounds rather than as a universal solution for all audio-haptic settings.

Deployment limits

Evidence in the provided sections supports evaluation in controlled study settings with selected datasets and baselines, not broad real-world deployment. Practical deployment constraints are discussed by the authors, but the detailed wording is not included in the focused excerpts.

Boundary conditions

Claims are bounded to environmental sound effects, four implemented signal-processing baselines, and user-rated perceptual alignment in the reported studies. Generalization evidence is limited to new clips from ESC-50 and the BBC Sound Effects library.

Position in field

This work advances audio-to-haptic translation by shifting the field from hand-crafted signal-processing mappings toward a perceptually trained generative model for environmental sounds. Its contribution is not just a model but a combined dataset-model-evaluation pipeline that makes user preference central to audio-to-vibration generation.

Abstract

Environmental sounds like footsteps, keyboard typing, or dog barking carry rich information and emotional context, making them valuable for designing haptics in user applications. Existing audio-to-vibration methods, however, rely on signal-processing rules tuned for music or games and often fail to generalize across diverse sounds. To address this, we first investigated user perception of four existing audio-to-haptic algorithms, then created a data-driven model for environmental sounds. In Study 1, 34 participants rated vibrations generated by the four algorithms for 1,000 sounds, revealing no consistent algorithm preferences. Using this dataset, we trained Sound2Hap, a CNN-based autoencoder, to generate perceptually meaningful vibrations from diverse sounds with low latency. In Study 2, 15 participants rated its output higher than signal-processing baselines on both audio-vibration match and Haptic Experience Index (HXI), finding it more harmonious with diverse sounds. This work demonstrates a perceptually validated approach to audio-haptic translation, broadening the reach of sound-driven haptics.