A Motion Synthesis Framework Without Motion Capture, Integrating Language and Visual Priors

Authors

  • Emily Carter Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), Clear Water Bay, Kowloon, Hong Kong SAR Author
  • Jason M. Bennett Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), Clear Water Bay, Kowloon, Hong Kong SAR Author
  • Sofia Almeida Division of Life Science, The Hong Kong University of Science and Technology (HKUST), Clear Water Bay, Kowloon, Hong Kong SAR Author
  • Thomas R. Walker Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (HKUST), Clear Water Bay, Kowloon, Hong Kong SAR Author

DOI:

https://doi.org/10.71465/fair289

Keywords:

motion synthesis, multimodal fusion, large language model, visual prompt, skeleton modeling, unsupervised generation, motion prior

Abstract

To address the problem that human motion synthesis heavily depends on motion capture data, this paper introduces a motion synthesis framework without motion capture, which integrates language and visual priors. This method builds a multimodal representation module to map natural language and image prompts into a shared latent space. A skeleton structure constraint network is also designed to improve the physical plausibility and continuity of motion generation. Experiments are carried out on HumanML3D, UMLS-Motion and a self-constructed multimodal instruction set. The results show that the proposed method improves motion quality (Frechet Gesture Distance) and coherence by 9.4% and 7.8%, respectively, compared to existing methods. Transfer tests show that the method performs well in both cross-domain and zero-shot prompts. The results confirm that combining visual and language-based multimodal prompts can effectively enhance the diversity and controllability of motion generation.

Downloads

Download data is not yet available.

Downloads

Published

2025-08-06