A Simulated Character Did a Backflip in 2018. Now It Controls a Real Robot.

In 2018, a simulated humanoid watched a motion capture clip of a backflip, then taught itself to execute one in a physics engine. The character stumbled, crashed, recovered, and eventually nailed it, all through reinforcement learning with no hand-coded rules about how a body should rotate in midair. The paper was called DeepMimic, and it was published at SIGGRAPH by Xue Bin Peng and colleagues. It has since been cited over 1,400 times.

Seven years later, in November 2025, a team building on that same lineage of research published BFM-Zero. A single neural network, trained without any task-specific reward, was deployed on a real Unitree G1 humanoid robot. The robot tracked motions, reached goals, and optimized behaviors it had never been explicitly taught, all zero-shot, all from one model.

Between those two dates lies a quiet revolution. The field of physics-based character control went from requiring a separate policy for every motion clip to building what researchers now call Behavior Foundation Models: pretrained generalist controllers that can be prompted, much like a language model, to produce any behavior within their learned distribution. This is the story of how that happened, and why it matters far beyond robotics.

The Forty-Year Warm-Up

Every modern breakthrough in physical character control rests on a foundation laid decades ago. In 1988, Andrew Witkin and Michael Kass published “Spacetime Constraints” at SIGGRAPH, framing character animation as constrained optimization: tell the character what to do, specify how it should be done, and let physics fill in the rest. The idea was elegant. The computation was brutal. A Luxo lamp jumping over an obstacle was about as complex as the hardware could handle.

Through the 1990s, Jessica Hodgins, then at Georgia Tech, pushed the boundary, simulating human athletics like running and vaulting with physically valid dynamics. These characters obeyed gravity and inertia. They also required hand-designed controllers for each motion, tuned by researchers who understood both biomechanics and differential equations. Adding a new skill meant months of engineering.

The motion capture revolution of the 2000s appeared to solve the problem by sidestepping it. If you could record a real human moving and replay that data on a digital character, why simulate physics at all? The game and film industries adopted this approach wholesale. It produced beautiful animation. It also produced characters that could not react to anything unexpected: a gust of wind, a shove, a floor that was not exactly where the animator expected it. Kinematic animation looks perfect until the world pushes back.

By 2014, a few researchers began asking whether deep learning could bridge the gap. Sergey Levine and colleagues at UC Berkeley started combining trajectory optimization with neural networks, showing that learned policies could handle continuous control in ways that classical methods could not scale to. The pieces were falling into place. But the field needed a demonstration that was undeniable.

When a Backflip Changed Everything

DeepMimic, published at SIGGRAPH 2018 by Peng, Abbeel, Levine, and van de Panne, provided that demonstration. The core idea was deceptively simple: give a reinforcement learning agent a motion capture clip as a reference, reward it for matching that reference while obeying physics, and let it figure out the rest. The agent ran in a physics simulator. It experienced gravity, ground contact, joint limits, and inertia. There was no motion playback. The character had to actively generate torques at every joint to reproduce the motion.

The results were striking. Simulated humanoids performed backflips, spin kicks, cartwheels, and complex martial arts sequences. They recovered from perturbations. They adapted when their body proportions were changed. And the quality was far beyond anything prior RL-based methods had achieved.

But DeepMimic had a critical limitation. Every motion clip required training a separate policy from scratch. Want a character that can both walk and do a backflip? Train two policies and figure out how to switch between them. Want 100 skills? Train 100 policies. The approach produced spectacular individual motions but no path to a general-purpose controller.

AMP (Adversarial Motion Priors), published at SIGGRAPH 2021 by the same group, attacked the reward engineering bottleneck. Instead of hand-designing a reward function that measured how closely the character matched a reference clip, AMP trained a GAN-style discriminator to distinguish “natural” from “unnatural” motion. The discriminator learned from unstructured motion data what good movement looked like. The character then learned to fool the discriminator while accomplishing tasks. The manual reward design disappeared, replaced by a learned style prior that generalized across clips.

ASE (Adversarial Skill Embeddings), published at SIGGRAPH 2022, took the next step: learning a latent space of reusable skills from a large motion corpus. A low-level policy mapped latent codes to physical actions; a high-level controller selected which latent code to activate. For the first time, a single trained model could perform diverse skills and be directed toward new tasks by training only the high-level selector. The repertoire was reusable. The architecture was hierarchical. And NVIDIA’s massively parallel GPU simulator, Isaac Gym, made it possible to train on over a decade of simulated experience.

The trajectory from DeepMimic to ASE traces a clear arc: from single-clip imitation, to style-aware reward learning, to reusable skill libraries. Each step removed a constraint that had made physics-based control impractical for production use. But even ASE required training a new high-level controller for every downstream task. The field still lacked a model that could generalize without retraining.

One Controller for Every Motion

The period from 2023 to 2024 produced a cluster of papers that, collectively, solved the generalization problem for simulated characters.

PULSE, published as an ICLR 2024 Spotlight paper by Zhengyi Luo and colleagues, distilled nearly the entire AMASS motion capture dataset into a 32-dimensional latent space with 99.8% coverage. The resulting physics-based controller could track virtually any human motion, from casual walking to complex dance, through a two-tier architecture: a low-level imitator handled the physics, while a learnable prior provided motion suggestions for downstream tasks. PULSE was not a single-task policy. It was a motion foundation, a universal low-level controller that higher-level systems could build upon.

MaskedMimic, published by Tessler, Peng, and colleagues at SIGGRAPH Asia 2024, reframed the entire problem. Instead of tracking a specific reference motion, MaskedMimic treated physics-based control as motion inpainting. Give the model partial information, a text description, a few keyframes, an object to interact with, any combination, and it fills in the full-body motion while respecting physics. The analogy to masked language modeling in NLP is deliberate. Just as BERT learned to predict missing words, MaskedMimic learned to predict missing motion. The result was a single controller that handled text commands, joystick input, path following, object interaction, and stylistic direction, all without task-specific reward engineering.

MoConVQ, presented at SIGGRAPH 2024 by Yao and colleagues, took yet another angle. Using VQ-VAE and model-based RL, it learned discrete motion tokens from tens of hours of motion data. Because the tokens were discrete, they could be generated by a GPT-style transformer, which meant text-to-motion through a language model. More remarkably, the discrete representation enabled in-context learning with large language models: describe a complex abstract task in natural language, and the LLM could compose motion tokens to accomplish it.

SuperPADL, also presented at SIGGRAPH 2024 by Juravsky, Peng, and colleagues, pushed the scale frontier. Training on over 5,000 motion skills through progressive distillation, where specialized RL experts were iteratively merged into larger generalist policies, SuperPADL produced a language-directed physics-based controller that ran in real time on a consumer GPU. The sheer number of skills was the point. Five thousand is not a research demo. It is approaching the vocabulary of motion needed for practical applications.

CLoSD, selected as an ICLR 2025 Spotlight, introduced the concept of closed-loop text-driven physics control. A fast autoregressive diffusion model, DiP, served as a real-time planner, generating 40-frame motion plans at 3,500 frames per second, which is 175 times faster than real time. An RL tracking controller executed those plans in physics simulation, and the simulation fed back to the planner, forming a closed loop. The character could navigate to locations, strike objects, sit on couches, and transition between these tasks seamlessly, all from text prompts. Where prior methods required dedicated controllers and elaborate reward functions for each task, CLoSD needed only a text instruction and a target.

These five papers arrived within roughly 18 months of each other. Taken together, they demonstrated that a single physics-based controller could handle arbitrary human motions directed by text, keyframes, objects, or latent codes. The generalization problem for simulated characters was, if not fully solved, fundamentally changed.

The Foundation Model Arrives

The work described above operated in simulation. Characters moved convincingly in physics engines, but they were not real robots. The Behavior Foundation Model paradigm emerged in 2025 to close that gap.

Meta FAIR fired the first shot with Meta Motivo, published in April 2025. The approach was methodologically distinctive. Rather than using supervised learning on motion capture data, Motivo trained an unsupervised RL algorithm called FB-CPR (Forward-Backward with Conditional Policy Regularization). The forward-backward representation embeds unlabeled trajectories into a shared latent space. A conditional discriminator encourages the policy to “cover” the full distribution of behaviors in the dataset. The result is a model that can be prompted at inference time, with no task-specific training, to track motions, reach goals, or optimize rewards.

The numbers are worth stating plainly. Without ever being trained on a specific task, Motivo achieved 73.4% of the performance of task-specific algorithms trained explicitly for each objective. Human evaluators rated its motions as “more human-like” than those of baselines in 83% of reward-based tasks and 69% of goal-reaching scenarios. Meta open-sourced the model and code under a CC BY-NC 4.0 license.

The limitations were equally clear. Motivo operated in simulation only. Its humanoid used proprioceptive observations, meaning no vision and no object interaction. Tasks that diverged significantly from the motion capture training distribution showed degraded performance. It was a proof of concept, but a powerful one.

In September 2025, Zeng and colleagues published BFM (Behavior Foundation Model), using a generative CVAE architecture pretrained on large-scale behavior data. The model acquired reusable behavioral knowledge through masked online distillation and was validated on both simulated and real humanoid platforms.

Then came BFM-Zero in November 2025, the paper that demonstrated what the paradigm could actually deliver. Built by a team spanning CMU, Meta FAIR, and the LeCAR Lab, BFM-Zero extended the FB-CPR algorithm with structured latent space learning, domain randomization across physical parameters (link masses, friction coefficients, joint offsets), and asymmetric training that handled the gap between full simulation state and partial real-world observations. The model was deployed zero-shot on a physical Unitree G1 robot. It tracked walking patterns, performed highly dynamic dances, executed fighting motions, and, when pushed off balance, recovered with what the authors described as “remarkably gentle, natural, and safe behavior.”

No reward function was designed for any of these tasks. No fine-tuning was performed after deployment. A single pretrained model, prompted at inference time, controlled a real robot across diverse whole-body behaviors.

From Simulation to Factory Floor

BFM-Zero was not the only sim-to-real success story of 2025. A parallel stream of research focused less on generality and more on the raw physical difficulty of getting real robots to move like humans.

KungfuBot, accepted at NeurIPS 2025, tackled the motions that every prior system had avoided: kungfu strikes, rapid dance sequences, and other highly dynamic actions that push humanoid hardware to its physical limits. The key insight was a two-stage pipeline. First, raw motion capture data was processed through extraction, filtering, correction, and retargeting, ensuring physical feasibility for the target robot. Second, an adaptive motion tracking system dynamically adjusted its tolerance for tracking errors, allowing the policy to maintain stability during aggressive movements without being so loose that quality suffered. Deployed on a Unitree G1, KungfuBot’s sim-to-real tracking metrics for Tai Chi closely matched simulation results, a rare quantitative validation that the reality gap was genuinely small. The follow-up, KungfuBot2, extended the system to handle minute-long motion sequences spanning multiple skills.

ExBody2, presented at an RSS 2025 workshop, approached the problem from the opposite direction: not maximal dynamics but maximal expressiveness. The framework separated keypoint tracking from velocity control and distilled privileged teacher policies into deployable student policies. The result was a humanoid that could walk, crouch, dance, and punch with expressive whole-body coordination, a repertoire better suited to service robotics and entertainment than to martial arts.

GMT (General Motion Tracking) by Chen, Ji, and colleagues, with Peng as co-advisor, introduced Adaptive Sampling and a Motion Mixture-of-Experts architecture to track diverse motions with a single unified policy in the real world. Where KungfuBot specialized in dynamic skills and ExBody2 in expressive ones, GMT aimed for breadth, a general-purpose tracking policy that worked across the spectrum.

The convergence point for all three is NVIDIA’s GR00T N1.6, announced in January 2026. GR00T combines a multimodal Vision-Language-Action model with whole-body RL control in Isaac Lab and synthetic data generation through COMPASS. It is designed to be the operating system for humanoid robots, integrating the perception, planning, and physical control layers that the research community has been developing separately. Whether that integration succeeds at production scale remains to be demonstrated, but the architecture reflects where the field is heading: a single foundation model that sees, understands language, and controls a physical body.

Beyond Robots: Where Physical Control Goes Next

The applications of physics-based character control extend well beyond humanoid robots. The same technology is reshaping industries that most AI coverage overlooks.

In game development, the implications are architectural. Current game characters use state machines: if the player presses jump, play the jump animation; if the character lands on uneven ground, hope the animation blends look acceptable. Physics-based controllers like MaskedMimic and CLoSD eliminate this entirely. The character exists in a physics simulation. It responds to the environment in real time. Text or design intent replaces manual animation triggers. For studios spending millions on motion capture and animation state graphs, this is not an incremental improvement. It is a different production model.

Film and visual effects face a similar disruption. Stunt previsualization, digital doubles, and crowd simulation have traditionally required separate specialized pipelines. A physics-based character that can take a text prompt like “stumble backward and fall down stairs” and produce a physically valid, unique performance each time eliminates the need for much of that infrastructure. Tools like Cascadeur, which uses AI-assisted physics to help animators create realistic keyframe animation, are already commercial. Ziva Dynamics, acquired by Unity and used by Sony Pictures Imageworks, simulates anatomical detail, muscles, fat, skin, with finite element methods. NVIDIA has demonstrated video-to-motion-capture technology that extracts movement data from ordinary footage without specialized hardware. Each of these tools operates at a different level of the production pipeline, but they all depend on the same underlying principle: physics as the arbiter of visual realism.

Wearable robotics and exoskeletons represent a less obvious but potentially transformative application. Behavior Foundation Models learn what “natural human movement” looks like from massive motion capture datasets. That knowledge can be used to generate assistive torques for exoskeletons, predict the wearer’s intended motion, and adapt to individual gait patterns. The sim-to-real pipeline that trains robot policies in simulation and deploys them on physical hardware applies directly to powered orthoses and rehabilitation devices. Nature published a 2024 study by Luo and colleagues showing that simulation-trained policies could transfer to real exoskeletons, reducing metabolic cost by up to 24.3% for walking. The connection between character animation research and clinical rehabilitation is not metaphorical. It is the same math.

Sports science and biomechanical analysis benefit from the musculoskeletal simulation capabilities that underpin much of this work. Physics-based models can simulate injury mechanisms, optimize rehabilitation protocols, and detect abnormal movement patterns, all using the same BFM-derived motion priors that power virtual characters and robots.

The Hard Problems That Remain

Acknowledging what has not been solved is as important as documenting what has.

The generality-versus-precision tradeoff remains real. BFM-Zero achieves zero-shot versatility, but task-specialized controllers still outperform it on any individual task. Meta Motivo’s 73.4% of specialist performance is impressive for zero-shot, but it means a 26.6% gap. For applications where precision matters, such as surgical assistance or high-speed industrial assembly, that gap is not acceptable. The field has not yet produced a foundation model that matches specialists across the board.

Object manipulation is the conspicuous gap. Nearly every system described in this article controls a character or robot’s own body. Picking up objects, using tools, manipulating materials: these contact-rich interactions involve physics that current simulators approximate poorly. Friction models are simplified. Deformable objects are expensive to simulate. The result is that a BFM can make a humanoid walk, dance, and fight, but not fold laundry. That limitation is not architectural. It is a simulation fidelity problem, and closing it will require either dramatically better physics engines or vastly more real-world data.

Long-horizon behavior remains challenging. KungfuBot2 demonstrated minute-long motion sequences, which is progress. But planning coherent behavior over tens of minutes, the kind of temporal reasoning needed for a robot butler to prepare a meal, is beyond current capabilities. CLoSD’s closed-loop planning runs in real time, but its planning horizon is measured in seconds, not minutes.

The computational cost of training is non-trivial. BFM-Zero trains in IsaacLab at 200 Hz using massive GPU parallelism. KungfuBot requires approximately 27 hours on an RTX 4090 per policy. SuperPADL’s progressive distillation pipeline is explicitly designed to manage the cost of scaling to thousands of skills. These training budgets are accessible to well-funded labs but remain prohibitive for independent researchers or small studios. Whether the field follows the language model pattern of increasing centralization, or develops more efficient training methods, is an open question.

What This Actually Means

Here is what the research trajectory supports.

The Behavior Foundation Model paradigm is real, and it is following the same scaling pattern as language models. The progression from DeepMimic (one clip, one policy) to BFM-Zero (one model, any prompt) mirrors the progression from task-specific NLP models to GPT. The critical ingredients are the same: large-scale pretraining data, unsupervised or self-supervised objectives, and prompting at inference time. The behavioral equivalent of “next token prediction” turns out to be forward-backward representation learning, and it works.

The merger of VLA models and Behavior Foundation Models is the most important convergence to watch. Vision-Language-Action models give robots eyes and ears. Behavior Foundation Models give them bodies. GR00T N1.6 represents the first serious attempt to integrate both into a single system. If that integration succeeds, the result is a robot that understands natural language, perceives its environment visually, and controls its body with the physical fluency that BFMs provide. Neither capability is sufficient alone. Together, they define the general-purpose physical agent.

Physics-based character control is no longer an academic curiosity. It is production infrastructure. SuperPADL runs 5,000 skills in real time on consumer hardware. CLoSD generates motion plans at 175 times real time. MaskedMimic accepts text, keyframes, objects, and scene information through a single interface. These are not research prototypes. They are components that game engines, film pipelines, and robotics stacks can integrate today. The transition from “interesting paper” to “shipping product” has already begun.

The people who solve sim-to-real transfer will own the future of physical AI. Every system in this article trains in simulation. The value is realized only when the policy works on real hardware or in a real production pipeline. Domain randomization, teacher-student distillation, and residual correction have closed the gap for locomotion and basic whole-body control. Manipulation and contact-rich interactions remain open. The lab that cracks sim-to-real for dexterous manipulation will have achieved something that unlocks the rest of the physical AI stack.

And finally: this field moves at the speed of one researcher’s career. Xue Bin Peng is a co-author on DeepMimic (2018), AMP (2021), ASE (2022), MaskedMimic (2024), SuperPADL (2024), GMT (2025), and PARC (SIGGRAPH 2025). He is not the only researcher driving progress, but the fact that a single person’s publication list traces the entire arc from single-clip imitation to behavior foundation models tells you something about the field’s maturity. It is young. The foundational ideas are still being written by the people who will be cited for the next thirty years. That makes this a field worth paying close attention to.