Sound2Hap: Learning Audio-to-Vibrotactile Haptic Generation from Human Ratings
A strong CHI contribution: instead of assuming fixed signal-processing rules transfer well to environmental sounds, the paper shows preference variability across sound types and responds with a human-rated generative model that performs better in controlled evaluation across two datasets.
Video Figure
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- generative knowledge typical · 35/268
- Novelty type
- artifact typical · 20/268
- Abstraction level
- artifact typical · 19/268
- Generalization target
- task class typical · 63/268
- Validation mode
- controlled experiment typical · 47/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
This paper’s strongest contribution is the way it reframes audio-to-vibration generation as a perceptual learning problem rather than a purely signal-processing one. The authors begin from a sensible but often under-tested assumption in prior work: that existing mappings developed for music, games, or narrow sound classes can be reused for broader environmental audio. Their first study directly probes that assumption at scale, using 1,000 clips and 4,000 generated vibrations, and the reported result is not that one baseline wins, but that preferences vary substantially across clips, classes, and categories. That empirical finding matters because it justifies the move to a learned model rather than treating machine learning as novelty for its own sake. Sound2Hap is therefore compelling as an artifact contribution: it is trained from human-rated audio-vibration pairs, evaluated against multiple baselines, and positioned as a perceptually aligned generator for diverse environmental sounds. The second study strengthens the case by testing on new clips from both ESC-50 and BBC sources, which gives at least some evidence of cross-dataset robustness. The contribution is further amplified by the open dataset and tool support described in the paper. At the same time, the available evidence here is still bounded to controlled studies, selected datasets, and the specific baselines implemented by the authors. So the paper is best read as a strong demonstration that perceptually trained generation can outperform fixed mappings for this task class, not as final proof of universal audio-haptic translation.
What Changed
Canon before
Existing audio-to-vibration methods rely on signal-processing rules tuned primarily for music or gaming contexts and show limited generalizability to diverse environmental sounds.
Departure from common sense
The paper argues against the expectation that one established signal-processing mapping should work broadly for environmental sounds, showing instead that preferences vary substantially across clips, classes, and categories.
Actual novelty
The main novelty is Sound2Hap, a CNN-based autoencoder trained from human-rated audio-vibration pairs to generate perceptually aligned vibrotactile signals for diverse environmental sounds, paired with a large rated dataset and comparative evaluation against four signal-processing baselines.
Evidence
The paper grounds its claims in two in-person user studies. Study 1 collected 8,000 ratings on 4,000 audio-vibration pairs from 1,000 ESC-50 clips with 34 participants, showing substantial variation in algorithm preference across sound types. Study 2 evaluated two Sound2Hap variants against class-best signal-processing baselines with 15 participants using new ESC-50 and BBC clips to test cross-dataset generalizability. The focused sections also explicitly frame a discussion of limitations and future work.
“ Using this dataset, we trained Sound2Hap, a CNN-based autoencoder, to generate perceptually meaningful vibrations from diverse sounds with low latency. In Study 2, 15 participants rated its output higher than signal-processing baselines on both audio-vibration match and Haptic Experience Index (HXI), finding it more harmonious with diverse sounds”
actual novelty · Abstract · confidence 0.97
“User preferences toward different algorithms varied across sound clips, classes, and categories. Below, we report preferences at the levels of five major categories, 50 sound classes, and 1,000 individual sound clips”
departure from common sense · 5.2 Results: Insights from User Ratings · confidence 0.96
“9 Discussion Below, we first reflect on the result, then discuss the utility of Sound2Hap and outline its limitations and future work.”
limitation · 9 Discussion · confidence 0.72
“ Step 4 - Final user evaluation (Section 7): A second in-person study with 15 participants showed that both Sound2Hap variants outperformed the best signal-processing baselines on perceptual ratings and HXI measures”
validation scope · 3 Overview of Sound2Hap Design and Evaluation Process · confidence 0.96
Limits
Method limits
The focused sections only explicitly indicate that the paper discusses limitations and future work, without exposing the detailed limitation text. Based on the grounded sections, the review should be read as validated primarily through controlled in-person studies on environmental sounds rather than as a universal solution for all audio-haptic settings.
Deployment limits
Evidence in the provided sections supports evaluation in controlled study settings with selected datasets and baselines, not broad real-world deployment. Practical deployment constraints are discussed by the authors, but the detailed wording is not included in the focused excerpts.
Boundary conditions
Claims are bounded to environmental sound effects, four implemented signal-processing baselines, and user-rated perceptual alignment in the reported studies. Generalization evidence is limited to new clips from ESC-50 and the BBC Sound Effects library.
Position in field
This work advances audio-to-haptic translation by shifting the field from hand-crafted signal-processing mappings toward a perceptually trained generative model for environmental sounds. Its contribution is not just a model but a combined dataset-model-evaluation pipeline that makes user preference central to audio-to-vibration generation.