A Motion Synthesis Framework Without Motion Capture, Integrating Language and Visual Priors
DOI:
https://doi.org/10.71465/fair289Keywords:
motion synthesis, multimodal fusion, large language model, visual prompt, skeleton modeling, unsupervised generation, motion priorAbstract
To address the problem that human motion synthesis heavily depends on motion capture data, this paper introduces a motion synthesis framework without motion capture, which integrates language and visual priors. This method builds a multimodal representation module to map natural language and image prompts into a shared latent space. A skeleton structure constraint network is also designed to improve the physical plausibility and continuity of motion generation. Experiments are carried out on HumanML3D, UMLS-Motion and a self-constructed multimodal instruction set. The results show that the proposed method improves motion quality (Frechet Gesture Distance) and coherence by 9.4% and 7.8%, respectively, compared to existing methods. Transfer tests show that the method performs well in both cross-domain and zero-shot prompts. The results confirm that combining visual and language-based multimodal prompts can effectively enhance the diversity and controllability of motion generation.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Emily Carter, Jason M. Bennett, Sofia Almeida, Thomas R. Walker (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.