CHI '26 · Best paper · full-paper review · confidence high

Scene2Hap: Generating Scene-Wide Haptics for VR from Scene Context with Multimodal LLMs

Arata Jingu , Easa AliAbbasi , Sara Safaee , Paul Strohmeier , Jürgen Steimle

Scene2Hap is a strong systems paper because it reframes VR haptic authoring as a scene-understanding problem, then operationalizes that idea with an LLM-plus-physics pipeline and backs it with three studies. Its contribution is meaningful, but its realism claims should be read within the authors’ explicit modeling limits.

Video Figure

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: generative knowledge typical · 35/268
Novelty type: system architecture typical · 35/268
Abstraction level: system typical · 61/268
Generalization target: design family typical · 38/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: strong typical · 158/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

Scene2Hap’s real contribution is not merely that it uses an LLM for haptics, but that it identifies a missing systems layer in VR haptic authoring: scene-wide inference about what objects are, how they are being used, what they are made of, and how vibrations should travel through surrounding structures. That is a substantive departure from prior approaches centered on manually authored effects or generation from isolated prompts and images. The architecture is coherent: multimodal scene extraction feeds prompt-chained inference, inferred semantics drive audio retrieval or generation, and inferred material/spatial relations drive runtime attenuation and propagation. This makes the paper notable as a systems integration contribution rather than a narrow model paper. The validation is also appropriately multi-pronged. Study 1 checks whether the inference pipeline can recover semantic and physical attributes across held-out scenes; Study 2 isolates the value of propagation and attenuation for usability, materiality, and spatial awareness; and Study 3 tests the end-to-end experience in a richer VR scene with both questionnaires and qualitative interviews. That combination gives the paper stronger support than a single demo study would. At the same time, the authors are careful enough to state meaningful limitations, and those matter for interpretation. The system currently simplifies object semantics to scene-level use and binary vibration behavior, assumes simplified geometries, propagates only to neighboring objects, depends on GPT-4o and external audio tooling, and focuses on active vibration sources rather than the broader space of touch interactions. So the paper should be read as a convincing proof of a new architectural direction for scalable VR haptics, not as a complete solution to general-purpose haptic realism. Within that scope, it is an impressive and field-shaping contribution.

What Changed

Canon before

Designing haptic feedback for VR scenes has traditionally been a manual, time-consuming process focusing on individual objects without leveraging full scene context or realistic physical interactions. Prior machine learning methods for haptic generation often lack scene-wide semantic understanding and ignore physical relationships between objects, especially vibration propagation effects.

Departure from common sense

The paper argues against treating haptic generation as an object-isolated problem. It claims scene-wide vibrotactile design should depend on inferred object semantics, use context, material properties, and inter-object relationships, rather than only prompts or images for single objects.

Actual novelty

The main novelty is Scene2Hap as a system architecture that combines multimodal LLM-based haptic inference with physics-inspired haptic rendering, using inferred semantics and material properties to retrieve or generate vibration-driving audio and to propagate and attenuate vibrations across neighboring objects in real time.

Evidence

The paper grounds its contribution with three studies: Study 1 evaluates semantic and material inference quality across held-out VR scenes, Study 2 tests whether attenuated propagation improves usability, materiality, and spatial awareness, and Study 3 examines end-to-end experience in a full VR scene with questionnaires and interviews. Evidence is strong for feasibility and user-perceived benefits, but the authors explicitly bound claims through limitations on object semantics, geometry, neighboring-object propagation, dependence on GPT-4o, and focus on active vibration sources.

“3 Scene2Hap Scene2Hap is an LLM-centered system that automatically designs object-level vibrotactile feedback for entire VR scenes, based on object semantics, physical properties, and spatial context. Its architecture is the first to use an LLM to extract information for haptic modeling from the VR scene, and uses this information for physics-inspired modeling for real-time user interaction”

actual novelty · 3 Scene2Hap · confidence 0.98

“ Researchers have proposed generative machine learning models to design haptic signals from manually formulated text prompts or from images, for instance, with generative adversarial networks [83, 84] or LLMs [49, 77]. While these studies provide valuable insights regarding the automat”

departure from common sense · 1 Introduction · confidence 0.96

“ Fourth, the system’s performance is dependent on the specific LLM used (we used GPT-4o)”

limitation · 6 Discussion · confidence 0.99

“To validate Scene2Hap, we conducted three studies investigating (1) the capability of LLM-based haptic inference, (2) the effect of physics-inspired haptic rendering on the user’s haptic perception, and (3) the overall experience in a full VR scene.”

validation scope · 5 Evaluation · confidence 0.97

Limits

Method limits

The method is limited by scene-level object semantics with binary vibration behavior, simplified geometry assumptions, propagation only to neighboring objects, dependence on external audio retrieval/generation quality, and reliance on a specific LLM configuration.

Deployment limits

The prototype runs on a Windows 10 PC with an RTX 4090 GPU, uses a client-server pipeline, requires around 9–12 seconds of LLM inference per object in the current setup, and was evaluated on moderate-size scenes and controller-mounted vibrotactile actuators rather than broader deployment settings.

Boundary conditions

The approach is best suited to VR scenes where vibrations are triggered by active sources such as machines or vibrating objects, where multimodal scene data are available, and where simplified propagation across connected neighboring objects is an acceptable approximation.

Position in field

This work extends prior automatic haptic generation by moving from single-object or prompt-based generation toward scene-wide, context-aware haptic authoring. Its contribution is less a new haptic actuator or isolated model than a hybrid architecture linking multimodal semantic inference to physically grounded rendering for scalable VR design.

Abstract

Haptic feedback contributes to immersive virtual reality (VR) experiences. However, designing such feedback at scale for all objects within a VR scene remains time-consuming. We present Scene2Hap, an LLM-centered system that automatically designs object-level vibrotactile feedback for entire VR scenes based on the objects' semantic attributes and physical context. Scene2Hap employs a multimodal large language model to estimate each object’s semantics and physical context, including its material properties and vibration behavior, from multimodal information in the VR scene. These estimated attributes are then used to generate or retrieve audio signals, subsequently converted into plausible vibrotactile signals. For more realistic spatial haptic rendering, Scene2Hap estimates vibration propagation and attenuation from vibration sources to neighboring objects, considering the estimated material properties and spatial relationships of virtual objects in the scene. Three user studies confirm that Scene2Hap successfully estimates the vibration-related semantics and physical context of VR scenes and produces realistic vibrotactile signals.