Scene2Hap: Generating Scene-Wide Haptics for VR from Scene Context with Multimodal LLMs
Scene2Hap is a strong systems paper because it reframes VR haptic authoring as a scene-understanding problem, then operationalizes that idea with an LLM-plus-physics pipeline and backs it with three studies. Its contribution is meaningful, but its realism claims should be read within the authors’ explicit modeling limits.
Video Figure
Axes Lens
Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.
Contribution shape
- Knowledge form
- generative knowledge typical · 35/268
- Novelty type
- system architecture typical · 35/268
- Abstraction level
- system typical · 61/268
- Generalization target
- design family typical · 38/268
- Validation mode
- mixed methods typical · 136/268
Evidence profile
- Evidence strength
- strong typical · 158/268
- Claim alignment
- strong typical · 231/268
- Overclaim risk
- medium typical · 210/268
Review Summary
Scene2Hap’s real contribution is not merely that it uses an LLM for haptics, but that it identifies a missing systems layer in VR haptic authoring: scene-wide inference about what objects are, how they are being used, what they are made of, and how vibrations should travel through surrounding structures. That is a substantive departure from prior approaches centered on manually authored effects or generation from isolated prompts and images. The architecture is coherent: multimodal scene extraction feeds prompt-chained inference, inferred semantics drive audio retrieval or generation, and inferred material/spatial relations drive runtime attenuation and propagation. This makes the paper notable as a systems integration contribution rather than a narrow model paper. The validation is also appropriately multi-pronged. Study 1 checks whether the inference pipeline can recover semantic and physical attributes across held-out scenes; Study 2 isolates the value of propagation and attenuation for usability, materiality, and spatial awareness; and Study 3 tests the end-to-end experience in a richer VR scene with both questionnaires and qualitative interviews. That combination gives the paper stronger support than a single demo study would. At the same time, the authors are careful enough to state meaningful limitations, and those matter for interpretation. The system currently simplifies object semantics to scene-level use and binary vibration behavior, assumes simplified geometries, propagates only to neighboring objects, depends on GPT-4o and external audio tooling, and focuses on active vibration sources rather than the broader space of touch interactions. So the paper should be read as a convincing proof of a new architectural direction for scalable VR haptics, not as a complete solution to general-purpose haptic realism. Within that scope, it is an impressive and field-shaping contribution.
What Changed
Canon before
Designing haptic feedback for VR scenes has traditionally been a manual, time-consuming process focusing on individual objects without leveraging full scene context or realistic physical interactions. Prior machine learning methods for haptic generation often lack scene-wide semantic understanding and ignore physical relationships between objects, especially vibration propagation effects.
Departure from common sense
The paper argues against treating haptic generation as an object-isolated problem. It claims scene-wide vibrotactile design should depend on inferred object semantics, use context, material properties, and inter-object relationships, rather than only prompts or images for single objects.
Actual novelty
The main novelty is Scene2Hap as a system architecture that combines multimodal LLM-based haptic inference with physics-inspired haptic rendering, using inferred semantics and material properties to retrieve or generate vibration-driving audio and to propagate and attenuate vibrations across neighboring objects in real time.
Evidence
The paper grounds its contribution with three studies: Study 1 evaluates semantic and material inference quality across held-out VR scenes, Study 2 tests whether attenuated propagation improves usability, materiality, and spatial awareness, and Study 3 examines end-to-end experience in a full VR scene with questionnaires and interviews. Evidence is strong for feasibility and user-perceived benefits, but the authors explicitly bound claims through limitations on object semantics, geometry, neighboring-object propagation, dependence on GPT-4o, and focus on active vibration sources.
“3 Scene2Hap Scene2Hap is an LLM-centered system that automatically designs object-level vibrotactile feedback for entire VR scenes, based on object semantics, physical properties, and spatial context. Its architecture is the first to use an LLM to extract information for haptic modeling from the VR scene, and uses this information for physics-inspired modeling for real-time user interaction”
actual novelty · 3 Scene2Hap · confidence 0.98
“ Researchers have proposed generative machine learning models to design haptic signals from manually formulated text prompts or from images, for instance, with generative adversarial networks [83, 84] or LLMs [49, 77]. While these studies provide valuable insights regarding the automat”
departure from common sense · 1 Introduction · confidence 0.96
“ Fourth, the system’s performance is dependent on the specific LLM used (we used GPT-4o)”
limitation · 6 Discussion · confidence 0.99
“To validate Scene2Hap, we conducted three studies investigating (1) the capability of LLM-based haptic inference, (2) the effect of physics-inspired haptic rendering on the user’s haptic perception, and (3) the overall experience in a full VR scene.”
validation scope · 5 Evaluation · confidence 0.97
Limits
Method limits
The method is limited by scene-level object semantics with binary vibration behavior, simplified geometry assumptions, propagation only to neighboring objects, dependence on external audio retrieval/generation quality, and reliance on a specific LLM configuration.
Deployment limits
The prototype runs on a Windows 10 PC with an RTX 4090 GPU, uses a client-server pipeline, requires around 9–12 seconds of LLM inference per object in the current setup, and was evaluated on moderate-size scenes and controller-mounted vibrotactile actuators rather than broader deployment settings.
Boundary conditions
The approach is best suited to VR scenes where vibrations are triggered by active sources such as machines or vibrating objects, where multimodal scene data are available, and where simplified propagation across connected neighboring objects is an acceptable approximation.
Position in field
This work extends prior automatic haptic generation by moving from single-object or prompt-based generation toward scene-wide, context-aware haptic authoring. Its contribution is less a new haptic actuator or isolated model than a hybrid architecture linking multimodal semantic inference to physically grounded rendering for scalable VR design.