CHI '26 · Honorable mention · full-paper review · confidence medium-high

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Haocheng Yuan , Adrien Bousseau , Hao Pan , Lei Zhong , Changjian Li

DancingBox is a credible CHI-style systems contribution: it reframes motion capture around everyday objects and a single webcam, then uses bounding-box conditioning to recover plausible animation. The novelty is in the proxy representation and pipeline, while the evaluation is solid but still bounded by monocular capture and non-real-time constraints.

Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form: technical knowledge typical · 50/268
Novelty type: system architecture typical · 35/268
Abstraction level: system typical · 61/268
Generalization target: task class typical · 63/268
Validation mode: mixed methods typical · 136/268

Evidence profile

Evidence strength: moderate typical · 105/268
Claim alignment: medium typical · 32/268
Overclaim risk: medium typical · 210/268

Review Summary

DancingBox is strongest as a systems-and-interaction contribution rather than as a pure motion-capture accuracy paper. The paper’s central move is conceptually simple but practically meaningful: instead of asking novices to perform precise human motions or learn professional animation software, it lets them manipulate ordinary objects and treats those objects as proxies for character motion. That is a clear departure from common-sense expectations about what motion capture requires, and it aligns well with CHI’s interest in lowering barriers to creative practice. The technical novelty appears to be the intermediate bounding-box representation and the associated permutation-invariant encoding/guidance used to condition a generative motion model. That is a plausible and coherent design choice for handling variable proxy structures, and it is more than a superficial UI wrapper around an existing model. The validation is also reasonably aligned with the claims: the user study covers both replication and creative tasks, reports 9 participants, and gives a concrete failure rate of 2 out of 27 replication trials, while also noting that users found the system easy to use and the motions realistic or natural. At the same time, the paper does not overstate the system as a replacement for high-fidelity capture. The stated limitations—monocular occlusion sensitivity, reduced accuracy relative to multi-camera systems, and non-real-time processing—are important because they define the boundary between an accessible creative tool and a production-grade capture pipeline. Overall, the work looks like a solid honorable-mention-level contribution: novel enough in interaction framing and representation, validated enough to support the main claims, but still constrained by the usual fidelity and deployment limits of monocular, learned reconstruction.

What Changed

Canon before

Prior CHI and graphics systems for character animation and motion capture generally rely on either expert-operated capture rigs, precise human motion tracking, or manual animation tools that are not novice-friendly.

Departure from common sense

The paper departs from the usual assumption that useful motion capture must track precise human motion with specialized hardware. It instead treats everyday manipulated objects as proxies, uses a single webcam, and reconstructs plausible character animation from coarse motion cues.

Actual novelty

The core novelty is a proxy-to-animation pipeline that uses 3D bounding boxes as an intermediate representation, together with a permutation-invariant box motion encoder and box-joint guidance to condition a generative motion model. This is presented as a way to bridge variable proxy structures to human skeletal motion.

Evidence

The paper combines a system contribution with empirical validation. The abstract and method sections support the claim that DancingBox reimagines motion capture as digital puppetry from everyday objects and introduces bounding-box-based conditioning. The user study evidence indicates 9 participants completed replication and creative tasks, with 2 failures out of 27 replication trials and reported ease of use and realism. The paper also acknowledges monocular occlusion sensitivity and non-real-time processing as limitations.

“ We thus introduce bounding boxes as an intermediate representation to bridge the input 3D points with realistic output motion”

actual novelty · Method (3 Method; 3.3 Box-guided Motion Generation) · confidence 0.72

“ Information & Contributors Bibliometrics & Citations Reading Options References Figures Tables Media Share Abstract Creating compelling 3D character animations typically requires either expert use of professional s”

departure from common sense · Abstract / Introduction · confidence 0.80

“1 ), and demonstrated an application by extending our system to support keyframe-based motion capture (Sec. 5.2 ). Please refer to the supplemental video for better dynamic motion visualizati”

limitation · Results and Discussions (5) and Ablation/Discussions (5.1) and Future Work (6) · confidence 0.78

“ \end{equation*} Across 27 trials (9 participants × 3 replication tasks), we observed only 2 failures (7.4%), indicating that in over 92% of cases the system preserved or improved similarity to the target motion.”

validation scope · User Experience Study (4.1-4.2) · confidence 0.86

Limits

Method limits

The monocular capture pipeline is sensitive to occlusion, and the system does not claim precise motion tracking. The generative reconstruction also depends on learned priors and synthesized proxy-animation pairs, which may limit fidelity outside the training distribution.

Deployment limits

Use is constrained by single-camera visibility, proxy manipulability, and the need for sufficient scene clearance. The paper also indicates non-real-time processing, which limits interactive deployment where immediate feedback is required.

Boundary conditions

The approach is best suited to novice-friendly character animation tasks where approximate proxy motion is acceptable and where users can manipulate visible everyday objects in front of a webcam. Performance may degrade with occlusion, unusual proxy configurations, or motions far from the learned motion prior.

Position in field

DancingBox sits at the intersection of novice-facing animation tools, vision-based motion capture, and generative motion synthesis. Its contribution is less about precise capture accuracy and more about lowering the interaction barrier by translating coarse proxy motion into plausible character animation.

Abstract

Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy–animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.