When exploring their surroundings, communicating with others and expressing themselves, humans can perform a wide range of body motions. The ability to realistically replicate these motions, applying them to human and humanoid characters, could be highly valuable for the development of video games and the creation of animations, content that can be viewed using virtual reality (VR) headsets and training videos for professionals.
Researchers at Peking University’s Institute for Artificial Intelligence (AI) and the State Key Laboratory of General AI recently introduced new models that could simplify the generation of realistic motions for human characters or avatars. The work is published on the arXiv preprint server.
Their proposed approach for the generation of human motions, outlined in a paper presented at CVPR 2025, relies on a data augmentation technique called MotionCutMix and a diffusion model called MotionReFit.
“As researchers exploring the intersection of artificial intelligence and computer vision, we were fascinated by recent advances in text-to-motion generation—systems that could create human movements from textual descriptions,” Yixin Zhu, senior author of the paper, told Tech Xplore.
“However, we noticed a critical gap in the technological landscape. While generating motions from scratch had seen tremendous progress, the ability to edit existing motions remained severely limited.”
Artists, video game developers and animation filmmakers typically do not create new content entirely from scratch, but rather draw inspiration from previous works, refining them and adjusting them until they attain their desired results. Most existing AI and machine learning systems, however, are not designed to support this editing and inspiration-based creative workflow.
“Previously developed systems that did attempt motion editing faced a significant constraint, namely, they required extensive pre-collected triplets of original motions, edited motions, and corresponding instructions—data that’s extremely scarce and expensive to create,” said Nan Jiang, co-author of the paper. “This made them inflexible, only capable of handling specific editing scenarios they were explicitly trained on.”
The key objective of the recent study by Zhu and his colleagues was to create a new system that could edit all human motions based on written instructions provided by users, without the need for task-specific inputs or body part specifications.
They wanted this system to support both changes to specific body parts (i.e., spatial editing) and the adaptation of motions over time (i.e., temporal editing), generalizing well across various scenarios even when trained on limited annotated data.
“MotionCutMix, the approach to machine learning that we devised, is a simple yet effective training technique that helps AI systems learn to edit 3D human motions based on text instructions,” explained Hongjie Li, co-author of the paper.
“Similarly to how chefs can create many different dishes by mixing and matching ingredients—MotionCutMix creates diverse training examples by blending body parts from different motion sequences.”
The learning approach developed by the researchers can select specific body parts (e.g., a character’s arms, legs, torso, etc.) in a motion sequence, combining these with parts present in another sequence. Instead of abruptly transitioning from the movements of one body part to those of another, MotionCutMix gradually blends the boundaries between them, thus producing smoother movements.
“For example, when combining an arm movement from one motion with a torso from another, it smoothly interpolates the shoulder area,” said Jiang. “For each blended motion, it creates a new training example consisting of an original motion, an edited version of that motion, and a text instruction describing the change.”
Most previously introduced approaches for generating human motions were trained on fixed datasets, typically containing annotated videos of people moving in different ways. In contrast, MotionCutMix can generate new training samples on-the-fly, which enables learning from large libraries of motion data that does not need to be manually annotated.
This is advantageous considering that most content that is readily available online is not annotated and thus cannot be leveraged by other existing approaches. Notably, the new framework developed by the researchers supports both the editing of what movement a specific body part is performing (i.e., semantic elements) and how it is doing it (i.e., stylistic elements).
“MotionCutMix requires far fewer annotated examples to achieve good results, creating potentially millions of training variations from a small set of labeled examples,” said Zhu.
“By training on diverse combinations of body parts and motions, the model learns to handle a wider range of editing requests. Despite creating more complex training examples, it doesn’t significantly slow down the training process. The soft masking and body part coordination create smoother, more natural edited motions without awkward transitions or unrealistic movements.”
In addition to the MotionCutMix training data augmentation approach, Zhu and his colleagues developed a motion generation and editing model called MotionReFit. While MotionCutMix can be used to create a diverse range of training samples, MotionReFit is an auto-regressive diffusion model that processes these samples and learns to generate and modify human motions.
In contrast with other human motion generation models, MotionReFit allows users to precisely modify sequences of human motions, simply by describing the changes they would like to make. To the best of the team’s knowledge, their system is the first that can handle both spatial and temporal edits without requiring additional inputs and user specifications.
“At its core, MotionReFit consists of an auto-regressive conditional diffusion model that processes motion segment by segment, guided by the original motion and text instructions,” explained Ziye Yuan, co-author of the paper.
“This design overcomes key limitations of previous approaches, as it works with arbitrary input motions and high-level text instructions, without needing explicit body part specifications. Meanwhile, it preserves natural coordination between body parts while making substantial changes to motion, while also achieving smooth transitions both spatially (between modified and unmodified body regions) and temporally (across frames).”
The researchers evaluated their proposed system in a series of tests and found that the quality of the human motions improved as the involvement of the MotionCutMix data augmentation technique increased. This confirmed their prediction that exposing the MotionReFit model to a wider range of motion combinations during training leads to better generalization across different motions and scenarios.
In addition, Zhu and his colleagues combined their data augmentation technique with a baseline model, called TMED. Remarkably, they found that MotionCutMix substantially improved the performance of this model, suggesting that it could be used to boost the learning of other architectures beyond MotionReFit.
“Despite introducing more complex training examples, training convergence is maintained even with high MotionCutMix ratios,” said Zhu.
“All variants converge within 800k steps, indicating the technique doesn’t create significant computational overhead. These findings collectively demonstrate that MotionCutMix addresses a fundamental challenge in motion editing—the limited availability of annotated triplets—by leveraging existing motion data to create virtually unlimited training variations through smart compositional techniques.”
In the future, the data augmentation technique and human motion generation model developed by this team of researchers could be used to create and edit a wide range of content that features human or humanoid characters. It could prove to be a particularly valuable tool for animators, video game developers and other video content creators.
“Motion editing enables animators to rapidly iterate on character movements without starting from scratch,” said Zhu.
“Game developers can generate extensive motion variations from limited captured data, creating diverse NPC behaviors and player animations. Human-robot interaction can be improved by enabling robots to adjust their movements based on natural language feedback. Manufacturing environments can fine-tune robotic motion patterns without reprogramming.”
The system created by Zhu and his colleagues relies on a text-based interface, thus it is also accessible to non-expert users who do not have experience with the creation of games or animations. In the future, it could be adapted for use in robotics research, for instance as a tool to improve the movements of humanoid service robots.
“Developing advanced motion representation techniques that better capture dependencies across longer sequences will be crucial for handling complex temporal patterns,” added Jiang. “This could involve specialized attention mechanisms to track consistency in sequential actions, and hierarchical models that understand both micro-movements and macro-level patterns.”
As part of their next studies, the researchers plan to broaden their system’s capabilities, for instance, allowing it to use uploaded images as visual references and make edits based on demonstrations provided by users.
They would also like to enhance its ability to edit motions in ways that are aligned with environmental constraints and with the context in which they are performed.
More information:
												Nan Jiang et al, Dynamic Motion Blending for Versatile Motion Editing, arXiv (2025). DOI: 10.48550/arxiv.2503.20724
© 2025 Science X Network
                                                Citation:
                                                Dynamic model can generate realistic human motions and edit existing ones (2025, April 13)
                                                retrieved 13 April 2025
                                                from https://techxplore.com/news/2025-04-dynamic-generate-realistic-human-motions.html
                                            
                                            This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
                                            part may be reproduced without the written permission. The content is provided for information purposes only.
                                            

