CHI '26 · Honorable mention · full-paper review · confidence medium-high

JustShape: Exploring Co-Speech Gestures for Multimodal LLM-Powered 3D Parametric Modeling

Runlin Duan , Yuzhao Chen , Yichen Hu , Ziyi Liu , Chenfei Zhu , Xiyun Hu , Dizhi Ma , Xinyi Wang , Karthik Ramani

JustShape is a credible CHI systems paper that reframes gesture input for parametric modeling as semantic, co-speech intent rather than command substitution. The novelty is strongest at the interaction-system level, and the evaluation is solid for bounded tasks, but the paper is appropriately cautious about latency and scalability limits.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: technical knowledge typical · 50/268
Novelty type: system architecture typical · 35/268
Abstraction level: system typical · 61/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: strong typical · 231/268
Overclaim risk: medium typical · 210/268

Review Summary

JustShape reads as a well-scoped CHI contribution in multimodal design tools: it does not merely bolt gestures onto an LLM interface, but argues for co-speech gesture as a meaningful channel for expressing geometric intent in parametric modeling. That framing is important because the paper explicitly contrasts its approach with prior gesture systems that mainly use gestures as proxies for mouse clicks or menu selections. The novelty is therefore not just “gesture input,” but a system architecture that parametrizes gestures and fuses them with speech through a tool-augmented multimodal LLM pipeline. The evidence packet supports this as a system-level contribution rather than a new theory or a broad field claim. Validation is also reasonably strong for CHI: the paper combines an elicitation study with two user studies, including a within-subject comparison against speech-only and sketch+speech baselines and a second study on more complex compositional tasks. That gives the paper credible evidence for usability and task performance in bounded modeling scenarios. At the same time, the authors are explicit about limits that matter for interpretation. They state that the system does not reduce multimodal LLM inference time and that latency remains the dominant delay source. They also acknowledge that the approach is not yet sufficient for complex, industry-level parametric modeling, and that later complex steps can be harder to recover from. So the right reading is: a strong interaction-system paper with a clear design contribution and good comparative evaluation, but not a general solution to parametric modeling at scale. The claim-evidence alignment is good, and the overclaim risk is moderate rather than low because the paper’s ambition is broad, even though the discussion is careful.

What Changed

Canon before

Prior CHI work on parametric modeling and gesture input largely treated gestures as command proxies or sketch-like supplements; natural language alone remained the main interface for LLM-assisted modeling.

Departure from common sense

The paper’s core stance is that gestures should not merely substitute for mouse/menu commands in CAD-like workflows; instead, co-speech gesture is treated as a direct, expressive channel for geometric intent that complements speech and can be translated into explicit parametric attributes.

Actual novelty

JustShape’s novelty is the combination of co-speech gestures with a multimodal LLM pipeline for parametric 3D modeling, including gesture parametrization and multimodal fusion to interpret synchronized speech and gesture into model parameters. The contribution is not just a new input modality, but a system architecture that turns embodied gesture into structured geometric attributes and then into executable parametric modeling commands, backed by comparative studies and a usability evaluation.

Evidence

The paper positions co-speech gesture as a new interaction modality for LLM-empowered parametric modeling, then validates it with an elicitation study, a comparative within-subject user study, and a second usability study. The evidence supports a system-level contribution: a multimodal fusion pipeline, gesture parameterization, and comparative evaluation against speech-only and sketch+speech baselines on modeling tasks. The limitations are explicit: inference latency remains dominant, and the approach is not yet suited to complex industry-scale parametric modeling.

“ Figure 8: Interface overview of the JustShape AR system, showing how users review and refine 3D modeling results while wearing a head-mounted display. A prompt display panel (a) presents system-generated modeling prompts, with subpanels (a-1, a-2) illustrating example text prompts and their parsed descriptions”

actual novelty · 4.1 System Workflow · confidence 0.70

“ublished : 13 April 2026 Publication History 1 citation 3 Downloads New Citation Alert added! This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been”

departure from common sense · 1 Introduction · confidence 0.66

“ These challenges are common across multimodal interaction systems and are shared by recent generative–model–based interfaces [ 5 , 29 ]. We anticipate that ongoing developments in specialized multimodal LLMs and spatial-reasoning models will reduce inference latency and support more reliable and seamless embodied interaction in future system”

limitation · 7.2 Robustness and Latency of the Multimodal Interaction · confidence 0.78

“ In this study, we compared our co-speech gesture interaction against two baseline conditions: text prompt and text plus sketch prompt, on a set of single-feature modeling tasks”

validation scope · 5.1 Participant and Procedures · confidence 0.72

Limits

Method limits

The evaluation is centered on controlled user studies and task-based comparisons, so it supports interaction and usability claims more than broad claims about general parametric modeling performance. The paper also notes that the system does not reduce multimodal LLM inference time, and that some gesture analysis stages remain variable.

Deployment limits

Practical deployment is constrained by LLM latency, and the authors indicate that reducing inference delay is outside the system architecture. The approach is also limited in handling complex or industry-level parametric models, and later-stage complex steps can be harder to recover from.

Boundary conditions

The contribution is best understood for novice or early-stage designers working on bounded parametric tasks where speech plus gesture can disambiguate intent. It is less established for expert CAD workflows, high-complexity industrial modeling, or settings where latency and error recovery are critical.

Position in field

This work sits at the intersection of multimodal interaction, gesture-based input, and LLM-assisted design tools. Its main field contribution is to move beyond gesture-as-command toward gesture-as-semantic design input for parametric modeling, backed by comparative user studies and a usability evaluation.

Abstract

Parametric modeling is a prevailing 3D modeling approach in design, architecture, and engineering. The emergence of multimodal large language models (LLMs) brings a new opportunity to lower the entry barriers to this powerful tool. However, describing 3D geometries through natural language can be fuzzy and challenging. We introduce co-speech gesture, a natural and expressive interaction modality to complement text prompts for LLM-empowered generative parametric modeling. We first conducted an elicitation study to explore and categorize co-speech gesture expressions. Based on the findings, we designed a multimodal fusion pipeline that parametrizes gestures and synthesizes them with speech. This approach reduces language ambiguity by translating implicit user intentions into explicit parametric attributes, thus lifting the model generation performance. We conducted a two-session user study testing and comparing it with traditional language and sketch inputs. This work streamlines the parametric modeling workflow and explores novel multimodal interaction paradigms for LLM-empowered design and creation.