← CHI 2026 map

CHI '26 · Honorable mention · full-paper review · confidence medium-high

MoSound: An Interactive Tool for Generative Sound Design in Motion Graphics

Jialin Huang , Prem Seetharaman , Timothy Richard Langlois , Li-Yi Wei , Rubaiat Habib Kazi , Yotam Gingold

MoSound is a credible CHI-style systems paper: the novelty is in integrating detection, mapping, and generative synthesis into a usable workflow for motion-graphics sound design. The evidence is strongest for the system contribution and mixed-initiative framing, while the main caveat is that automatic event placement and longer-duration coherence remain limited.


Axes Lens

Rare contribution shape, typical evidence profile. The point here is not a score. It is to show what kind of claim the paper makes, and whether the evidence pattern is unusual or baseline in this 268 -review set.

Contribution shape

Knowledge form
technical knowledge typical · 50/268
Novelty type
tool typical · 14/268
Abstraction level
system typical · 61/268
Generalization target
task class typical · 63/268
Validation mode
mixed methods typical · 136/268

Evidence profile

Evidence strength
moderate typical · 105/268
Claim alignment
medium typical · 32/268
Overclaim risk
medium typical · 210/268

Review Summary

MoSound reads as a solid, task-focused interactive system contribution rather than a fundamentally new audio model. The paper’s main value is in assembling several technically plausible components—visual event detection, motion tracking, motion-to-sound mapping, and generative sound stylization—into a coherent workflow for motion graphics, where timing and sonic character are tightly coupled to short visual events. That is a meaningful CHI contribution because it turns a messy creative practice into something partially structured and inspectable, while still leaving room for user control. The evidence packet supports that framing: the paper explicitly describes automatically identified events from a VLM, user-adjustable timing, motion-characteristic mapping to volume and stereo panning, and synthesis from text plus optional mapped properties. The validation is also appropriately mixed: technical observations about duration and event placement are paired with a user study involving experts and novices. At the same time, the limitations are important and should temper any broad reading. The paper itself says automatic event placement is the weakest aspect, and it acknowledges that high-quality temporal coherence is most reliable only for clips up to roughly 30 seconds, with drift or reduced consistency beyond that. It also appears to leave some professional production needs under-addressed, especially fine-grained layering, mixing, and continuous textures. So the right expert read is: strong system integration and a useful mixed-initiative workflow for a specific creative task class, with credible but bounded validation and clear constraints on scale and polish.

What Changed

Canon before

Prior CHI work on sound design tools typically separates event spotting, timing, and sound generation, or relies on manual authoring and library-based workflows. This paper positions MoSound against that baseline by integrating detection, mapping, and generative synthesis in one interactive workflow.

Departure from common sense

The paper’s core move is to treat motion-graphics sound design as a pipeline that can be partially automated from video itself: automatically identify events, suggest sound effects, let users map motion to sonic parameters, and synthesize audio. That departs from the common manual workflow of keyframing and library selection.

Actual novelty

MoSound’s novelty is presented as a human-in-the-loop interactive workflow that combines visual event detection, motion-to-sound mapping, sound effect suggestion, and user control in one interface. The paper also claims a motion-to-sound mapping that connects visual events with generative sound effects, plus evidence that the mixed-initiative design helps both novices and experts.

Evidence

The evidence supports a system-level contribution: MoSound integrates VLM-based event identification, motion tracking, parameter mapping, and generative synthesis into one workflow. Validation is mixed-methods, combining technical observations about duration/consistency and event-placement accuracy with a user study of experts and novices. The strongest grounded limitation is that automatic event placement is the weakest aspect, and the system is most reliable on short clips.

“ The contributions of our work are as follows: • MoSound , a human-in-the-loop interactive workflow that facilitates sound designs for motion graphics videos by combining visual event detection, motion-to-sound mapping, sound effect suggestion, and user contro”

actual novelty · Contributions list in Introduction · confidence 0.65

“ 7 Discussion and Implications for Design MoSound shows how automatic event detection, motion curve extraction, and generative audio synthesis can be combined into a single workflow that lets users control sound timing and dynamics directly from the v”

departure from common sense · Introduction + System overview (MoSound pipeline description) · confidence 0.50

“ The weakest aspect of MoSound is the accuracy of automatic event placement on the timeline (Q4)”

limitation · User Study (Q4) · confidence 0.75

“ The sound-synthesis model does not impose a strict architectural limit on duration; however, empirical evidence from public examples and our own observations indicates that high-quality, temporally coherent generation is most reliable for clips up to roughly 30 second”

validation scope · Technical Evaluation (Running Time/Scalability) · confidence 0.70

Limits

Method limits

Validation is not a broad benchmark; it is tied to example motion-graphics videos, a user study, and technical observations about event placement and temporal coherence. The paper itself notes that automatic event placement is the weakest aspect, so claims about robustness should be read cautiously.

Deployment limits

The system is most effective for short motion-graphics sequences, with empirical reliability described as strongest up to roughly 30 seconds. Beyond that, synthesized audio may drift or lose consistency, and the workflow lacks some fine-grained professional controls for layering, mixing, and continuous textures.

Boundary conditions

Best suited to short, abstract motion-graphics clips where event detection and motion-to-sound mapping are useful. Performance and coherence are bounded by clip length, and the interface appears less complete for advanced production needs that require detailed mixing or texture control.

Position in field

MoSound sits at the intersection of creative AI, sound design tools, and motion-graphics authoring. Its contribution is less a new audio model than an integrated interactive system that operationalizes generative sound design for a specific creative task class.

Abstract