How Do You Teach a Robot to Fold an Origami Fox?

In a Google DeepMind laboratory, a pair of robot arms picks up a flat sheet of paper and begins folding it. The creases are deliberate. The motions are smooth. A few minutes later, the paper has become something recognizable: an origami fox. The robot learned this from fewer than 100 demonstrations, using a model called Gemini Robotics that was introduced in March 2025.

Meanwhile, 6,000 miles away in Sunnyvale, California, a Figure 02 humanoid picks up household objects it has never encountered before, simply because someone said so in plain English. Its brain, called Helix, runs on two small GPUs embedded inside the robot’s body, producing 200 control signals per second across 35 degrees of freedom.

These are not incremental improvements. They reflect a fundamental shift in how researchers think about giving machines the ability to physically interact with the world. At ICLR 2026, 164 papers were submitted on a single model architecture, Vision-Language-Action models. That is 18 times the number from the previous year. The science is outpacing the hardware.

The Brain Transplant Problem

The central question in Physical AI research is deceptively simple: how does a robot decide what to do with its hands? For decades, the answer involved hand-coded rules, inverse kinematics solvers, and task-specific programs. A robot that could weld car doors could not pour coffee. Each task required a separate engineering effort.

The new answer is a class of neural networks called Vision-Language-Action models, or VLAs. The concept emerged from a straightforward insight: if large language models could learn grammar and reasoning from text, and if vision-language models could learn to describe images, then perhaps a model trained on robot trajectories alongside internet-scale vision-language data could learn to control physical actions.

The starting gun was RT-1, published by Google in December 2022. The idea was almost absurdly simple: treat camera images, joint positions, and language commands all as tokens, then feed them into a Transformer. No inverse kinematics. No task-specific code. Just a 35-million-parameter model trained on 130,000 real-world episodes collected across 13 robots over 17 months. RT-1 performed over 700 tasks at 97% success, and generalized to novel objects 25% better than the next baseline. It proved that the language model paradigm could work for robot control.

RT-2, published seven months later, took the next logical step: what if the robot’s brain was not just a Transformer, but a full vision-language model already trained on the entire internet? A 55-billion-parameter model trained to output robot actions as text tokens demonstrated something unexpected: it could perform chain-of-thought reasoning before executing physical tasks. When asked to pick up the object that is the same color as the banana in the picture, RT-2 used its world knowledge to reach for the yellow item. The paradigm shifted from skill-based control to semantic-based control.

Then came the cross-embodiment revolution. The Open X-Embodiment project, a collaboration of 34 research labs, pooled over one million robot trajectories from 22 different robot types into a single dataset. Octo, trained on 800,000 of those trajectories at UC Berkeley, became the first open-source generalist robot policy that could be fine-tuned to entirely new robot platforms in hours on a consumer GPU. It outperformed RT-1-X across three different embodiments and matched the 55-billion-parameter RT-2-X with a model under 100 million parameters. The message was clear: one model could control many robots if the data was diverse enough.

What followed was a research arms race. Stanford and UC Berkeley released OpenVLA, which formalized the approach Octo had pioneered by fully integrating action tokenization into a 7-billion-parameter vision-language model. It outperformed RT-2 by 16.5% in task success rate across 29 tasks while being seven times smaller. Physical Intelligence launched pi-zero, a 3.3-billion-parameter model that abandoned discrete tokens entirely in favor of continuous action generation, and then open-sourced it in February 2025. By September 2025, they had raised $600 million at a $5.6 billion valuation.

The proliferation of models created an uncomfortable question that researchers are still debating: what actually qualifies as a VLA? Moritz Reuss, analyzing the ICLR 2026 submissions, argues that the defining feature is not the architecture but the training data. A true VLA, he proposes, requires internet-scale pretraining on vision-language data, the ingredient that theoretically enables generalization beyond the robot’s own experience. By that definition, many of the 164 submissions may not qualify.

Fast Hands, Slow Thoughts

The most consequential architectural debate in Physical AI right now is not about model size. It is about speed.

A robot folding origami needs to reason about what fold comes next, which is a slow, deliberate process. But the actual folding motion requires sub-millisecond adjustments to finger pressure and paper angle, which is extremely fast. These two cognitive demands pull in opposite directions. Reasoning benefits from large models with deep attention layers. Motor control benefits from small, fast networks running at hundreds of hertz.

Figure AI’s Helix solved this with a dual-system architecture explicitly inspired by Daniel Kahneman’s theory of fast and slow thinking. System 2, a 7-billion-parameter vision-language model, processes the scene and the language instruction at 7 to 9 hertz, deciding what to do. System 1, an 80-million-parameter visuomotor policy, translates those decisions into continuous robot actions at 200 hertz, deciding how to do it. The two systems are trained end-to-end, communicating through a latent channel, but they run asynchronously on separate processors.

NVIDIA adopted the same philosophy for GR00T N1.6, which integrates Cosmos Reason for high-level understanding and a separate action model for physical execution. The dual-system approach is becoming the default for humanoid robots, where the action space, spanning dozens of joints and fingers, is too complex for a single model to handle at adequate speed.

CogACT, from Microsoft Research, takes the separation further. Its upper module, built on DINOv2 and SigLIP vision encoders, processes the scene and decides what sub-tasks to execute, a high-level cognitive plan. Its lower module, a dedicated diffusion transformer, translates that plan into stable continuous motor commands. The results quantify what structural separation buys you: CogACT exceeded OpenVLA’s success rate by over 35% in simulation and 55% in real-robot experiments, while using the same 7-billion-parameter scale. The architecture proved that you do not need a bigger model. You need a better division of labor.

The alternative school of thought, represented by pi-zero and Gemini Robotics, argues that a single unified model can do both if the architecture is designed correctly. Pi-zero uses a two-expert Mixture-of-Experts design: one expert handles vision and language tokens through the pretrained PaliGemma backbone, while a separate 300-million-parameter Action Expert processes robot state and generates actions through flow matching. The experts share a transformer’s self-attention, but each has its own weights. It is a single model with two specialized pathways, rather than two separate models.

Google DeepMind’s Gemini Robotics takes the single-model approach furthest. Built on Gemini 2.0, it adds physical actions as a new output modality alongside text and images. Its head of robotics, Carolina Parada, describes dexterity as requiring “both spatial reasoning and complex physical manipulation.” The model’s origami-folding ability is real, but the caveat matters: the dexterity is trained per-task from curated demonstrations, not generalized across all manipulation. Single-task diffusion models trained from scratch performed comparably on simpler tasks.

A team at Shanghai AI Lab tried to end the guesswork. RoboVLMs ran over 600 controlled experiments across 8 vision-language backbones and 4 policy architectures, systematically varying every design choice, which vision encoder, which language model, discrete versus continuous actions, with or without history. Their conclusions are sobering: no single backbone dominates, cross-embodiment pretraining does not consistently help, and modest architectural choices like the action head formulation often matter more than model scale. It is the closest thing the field has to an empirical design manual.

The honest assessment is that neither approach has won. Dual systems achieve better real-time performance on complex embodiments. Single models offer simpler training and deployment. The field is converging toward hybrid solutions, like pi-zero’s MoE design, that try to capture the benefits of both.

Imagining Before Acting

Underneath the VLA debate lies an even more fundamental question: how should a robot generate actions?

The dominant paradigm until 2023 was behavior cloning, where a neural network learns to map observations directly to actions through supervised learning on expert demonstrations. It works, but it struggles with multimodal action distributions, situations where multiple valid actions exist for the same observation. A robot approaching a cluttered table could reach from the left or the right, and a behavior cloning model that averages these possibilities might freeze or produce a nonsensical intermediate motion.

Diffusion Policy, introduced by Cheng Chi and colleagues at Columbia University, solved this by borrowing from image generation. Instead of predicting actions directly, the model learns to iteratively denoise a random action sequence into a coherent trajectory, exactly the way Stable Diffusion generates images from noise. The results were dramatic: a 46.9% average improvement over existing methods across 15 manipulation tasks.

The insight is elegant. A diffusion process naturally represents multiple modes, so when several valid actions exist, the model can generate any of them rather than collapsing to an average. And by producing entire action sequences rather than individual timesteps, the robot plans several moves ahead while maintaining temporal consistency.

A comprehensive survey published in Frontiers in Robotics and AI in July 2025 cataloged how this approach has propagated. Diffusion models now appear in trajectory generation, grasp synthesis, and even data augmentation for training other policies. They handle high-dimensional action spaces better than GANs, which suffer from training instability, and better than variational autoencoders, which tend to produce blurry, averaged outputs.

The drawback is speed. A typical diffusion policy requires 10 to 100 iterative denoising steps to generate one action sequence, which is computationally expensive for real-time control. This is where the field is concentrating its engineering effort. A NeurIPS 2025 paper introduced “genetic denoising,” a population-based sampling strategy that reduces the required steps to just two neural function evaluations while maintaining or improving performance. That is a 5-to-50x speedup, bringing diffusion policies within reach of real-time deployment.

These sequence-level predictions connect to a foundational concept called action chunking. Introduced by Tony Zhao and colleagues at Stanford in 2023 alongside the ALOHA teleoperation system, the idea is to predict not one action per timestep but an entire chunk of the next k actions at once. This reduces the effective task horizon by a factor of k, which sharply cuts the compounding errors that plague single-step behavior cloning. When combined with temporal ensembling, where overlapping chunks are blended for smoother execution, action chunking enabled bimanual robots to perform tasks like threading cable ties from just 50 demonstrations. The concept became a standard ingredient: pi-zero, Helix, and virtually every modern VLA now predicts action chunks rather than individual timesteps.

Flow matching, the technique used by pi-zero, offers a different solution. Instead of learning to reverse a noisy diffusion process, it learns to directly map noise to actions through a velocity field. The training objective is simpler, the noise schedule is unnecessary, and the generation can be faster. But flow matching has its own failure mode: with too few sampling steps, it risks mode collapse, generating only the most common action rather than exploring alternatives.

A systematic review analyzing 102 VLA models found that diffusion-based decoders demonstrated superior cross-domain transfer and robustness compared to autoregressive token heads, the approach used by RT-2 and OpenVLA. The evidence favors diffusion and flow matching for action generation, but the gap narrows significantly on simpler tasks where autoregressive models remain competitive.

Building a Universe to Train a Hand

Even the best VLA model needs data to learn from. And real-world robot data is agonizingly expensive to collect. Helix was trained on approximately 500 hours of teleoperated demonstrations. Pi-zero used over 10,000 hours across seven robot types and 68 tasks. Google DeepMind collected “thousands of hours” over 12 months on a fleet of ALOHA 2 robots.

This is where World Foundation Models enter the picture. The concept is straightforward in principle and staggering in execution: build a generative model of the entire physical world, then use it to produce unlimited synthetic training data.

NVIDIA’s Cosmos platform is the most developed implementation. Cosmos Predict 2.5, trained on 200 million curated video clips and refined with reinforcement learning, can generate physically plausible videos from text, image, or video prompts. Cosmos Transfer 2.5 performs Sim-to-Real translation, taking a simulated scene and generating a photorealistic version of it, effectively adding weather, lighting, and terrain variations that would be impossible to replicate in a physical lab.

The pipeline is concrete. A developer creates a simulation environment in NVIDIA Isaac Sim. Cosmos Transfer 2.5 converts those simulated scenarios into photorealistic training data. A robot policy is trained in Isaac Lab on this synthetic data. The policy is then validated in the simulator before being deployed on the real robot, often with zero additional real-world training.

Skild AI uses this exact pipeline: Isaac Lab for scalable simulation combined with Cosmos for synthetic data variation. Serve Robotics has generated training data from thousands of simulated scenarios, training a fleet that has completed over 100,000 real-world deliveries.

The skeptical question is whether synthetic data can truly replace real experience. The answer, based on current evidence, is: partially. World Foundation Models are excellent at generating visual diversity, teaching a robot to recognize objects under different lighting or from different angles. They are less reliable at simulating fine-grained physics: the way a sheet of paper bends, the friction between a rubber finger and a glass surface, the dynamics of liquid in a cup. These contact-rich interactions remain the hardest to simulate accurately, which is precisely why the origami fox is such a powerful benchmark. It tests everything that simulation gets wrong.

The Gap That Keeps Moving

The simulation-to-reality gap, or sim-to-real gap, has been the central unsolved problem in robotics for over a decade. Every time researchers close one aspect of it, another opens.

Domain randomization, the technique of randomly varying simulation parameters to make trained policies robust, produced early successes. The Humanoid-Gym framework demonstrated zero-shot sim-to-real transfer for humanoid locomotion. Hwangbo and colleagues at ETH Zurich showed in 2019 that reinforcement learning could produce agile quadruped gaits that transferred directly to real hardware. The approach works, but it works best for locomotion, where the physics is relatively well understood. Manipulation, with its complex contact dynamics, resists easy randomization.

The latest approach, residual-action reinforcement learning, addresses a subtler problem. Even when a policy transfers successfully, it accumulates errors over time because the simulated robot body does not perfectly match the real one. RobotDancing, presented in September 2025, adds a residual correction layer that compensates for model-plant mismatch in real time, enabling robust long-horizon humanoid motion tracking that would otherwise drift into failure.

NVIDIA’s sim-to-real pipeline for GR00T N1.6 combines whole-body reinforcement learning in Isaac Lab with synthetic-data navigation using a system called COMPASS. The result is zero-shot transfer with strong cross-embodiment performance, meaning a policy trained on one robot body can work on another with minimal fine-tuning. This is important because the commercial viability of Physical AI depends on not having to retrain from scratch for every robot model.

But the ICLR 2026 analysis reveals an uncomfortable truth about the state of the field. Open-weight VLA models perform impressively on simulation benchmarks. LIBERO is “basically solved,” with most approaches reaching 95 to 98% success. CALVIN scores above 4.0 are routine. Yet these same models lag significantly behind closed-weight models like Gemini Robotics and pi-zero-point-five on zero-shot real-world tasks. The benchmarks are not testing what matters most.

A review of nine physics engines used in reinforcement learning research found that engine selection directly impacts sim-to-real success. MuJoCo, Isaac Gym, and PyBullet each have different strengths and failure modes. The choice of physics engine is not a minor engineering detail. It is a research decision that can determine whether a policy transfers to reality or does not.

Fitting Intelligence Inside the Body

Everything described so far assumes generous compute: cloud GPUs, TPU clusters, or at minimum a workstation-class machine connected to the robot by wire. But real robots do not live in data centers. They run on Jetson modules, embedded x86 boards, or the pair of small GPUs packed inside a humanoid’s torso. If the model cannot run on the robot, the model does not matter.

This is where a different kind of research begins. Not what intelligence a robot should have, but how to physically fit that intelligence inside its body.

BitVLA, published in June 2025, attacks the memory problem head-on. It quantizes every parameter to 1.58 bits, meaning the entire model uses only ternary values: -1, 0, and +1. The vision encoder is compressed through a distillation-aware training strategy where a full-precision teacher guides a 1.58-bit student to preserve representational quality. The result: BitVLA matches the performance of OpenVLA-OFT at 4-bit quantization on the LIBERO benchmark while consuming only 1.4 GB of memory, roughly 30% of the 4-bit model’s footprint. It is the first demonstration that a VLA can run within the memory constraints of edge hardware without collapsing in capability.

But a small model is not enough if it thinks slowly. PD-VLA, accepted at IROS 2025, attacks the latency problem. Standard VLAs generate action sequences autoregressively, one token at a time, which means that longer action chunks take proportionally longer to decode. PD-VLA reframes autoregressive decoding as a nonlinear fixed-point equation and solves it through parallel iteration, updating all timesteps simultaneously. The result is a 2.52x speedup in execution frequency on 7-degree-of-freedom manipulators with no retraining, no architecture changes, and no performance loss. It is the decoding equivalent of what flash attention did for transformer inference.

Even with fast decoding, a subtler problem remains. When a robot executes one action chunk and then switches to the next, there is a boundary where the two chunks meet. If the transition is not smooth, the robot jerks or pauses, a problem that becomes dangerous at high speeds. Real-Time Chunking (RTC), published by Physical Intelligence and presented at NeurIPS 2025, solves this by treating chunk boundaries as an inpainting problem. While the robot executes the current chunk, RTC generates the next one in the background, freezing the actions already committed and regenerating the remainder to ensure continuity. No retraining is needed; it works as an inference-time wrapper on any diffusion or flow-based VLA. RTC enabled robots to strike matches and plug in Ethernet cables even with 300+ milliseconds of inference delay.

These three papers address different layers of the same challenge. BitVLA shrinks the model to fit the hardware. PD-VLA accelerates how fast the model can think. RTC ensures that what the model thinks translates into smooth physical motion. Together, they define the engineering frontier where Physical AI leaves the cloud and enters the body.

What This Actually Means

Here is what the research data supports.

The VLA architecture has won the argument for robot brains, but the specific design is still in flux. The 18x growth in VLA submissions at ICLR 2026 signals that the research community has converged on the general approach: pretrain on internet-scale vision-language data, fine-tune on robot trajectories. Whether the output should be discrete tokens or continuous actions, whether the architecture should be single or dual, whether the action decoder should use diffusion or flow matching, these questions remain genuinely open. Anyone claiming a settled answer is selling something.

Diffusion and flow matching have changed how robots generate actions, and the efficiency problem is being solved. The trajectory from 100 denoising steps to 2 steps in under two years is remarkable. If genetic denoising and flow matching continue to improve at this rate, real-time diffusion policies will be standard equipment within a year. The implications extend beyond robotics: the same techniques are applicable to any control problem with multimodal action distributions.

World Foundation Models are real engineering tools, not marketing. Cosmos Predict 2.5 and Transfer 2.5 are being used by production robotics companies to generate training data at scale. The sim-to-real pipeline of simulation, synthetic data augmentation, policy training, and zero-shot transfer is operational. The limitation is not the pipeline itself but the fidelity of contact physics simulation, and that is a problem with a long tail.

The most important gap in the field is not technical but evaluative. Current benchmarks are saturated. The best open models score near-perfectly on LIBERO and CALVIN, yet fail significantly on real-world zero-shot tasks. The field needs new evaluation frameworks that measure what actually matters: generalization to unseen objects, environments, and instructions in physical space. Until those exist, the distance between benchmark performance and real-world capability will remain invisible.

The real race is in data, not architecture. Helix used 500 hours of demonstrations. Pi-zero used 10,000. Gemini Robotics collected thousands of hours on dedicated robot fleets. The next generation of VLAs will be differentiated not by clever architectures but by who has the best data, both real and synthetic. Surprisingly few ICLR 2026 submissions addressed data quality, even though researchers acknowledge that the Open X-Embodiment dataset is “mostly low-quality data.” This is the blind spot.

On-device execution is no longer a nice-to-have. It is the deployment bottleneck. BitVLA proved that ternary weights can preserve VLA capability at 30% of normal memory. PD-VLA showed that parallel decoding can double execution frequency without retraining. RTC demonstrated that smooth real-time control is possible even with 300 ms inference delays. The convergence of these three lines of work means that the question is no longer whether a VLA can run on a robot’s onboard hardware, but when the engineering matures enough for production deployment. The companies that solve on-device inference first will have a structural advantage that no amount of cloud compute can replicate.