"Cross-Embodiment Full-Body Motion Data Factory" officially launched, BridgeDP fills the data gap in motion control

Data Factory

A Note Up Front

We have built a Cross-Embodiment Whole-Body Motion Data Factory (Cross-Embodiment Whole-Body Motion Data Factory), creating a complete chain from motion design and synchronized capture to cross-embodiment retargeting, data augmentation, training, and feedback. As a continuously operating infrastructure, the data factory can provide data fuel for the humanoid robot operating system, enabling the whole-body motion models in the system to keep receiving cross-embodiment, trainable, and reusable data assets.

The constraints on model capability are becoming increasingly clear. If data scale is insufficient, the model cannot cover enough motions; if data quality is unstable, the model will learn incorrect contact relationships and body coordination patterns; if data cannot be reused across embodiments, then once the robot hardware changes, much of the training asset must be produced again from scratch.

For general-purpose whole-body motion models, data is no longer just training material, but a critical asset that defines the boundary of capability.

Based on this pilot and two years of engineering practice across multiple legged robot platforms, we are moving the data factory from internal validation toward formal large-scale construction. Its job is to continuously plan motions, synchronize multi-source signals, retarget across embodiments, perform physical validation and data augmentation, and feed training results back into the next production cycle.

This article shares our phased thinking on the motion control data factory: what cross-embodiment whole-body motion data is, why we need a dedicated factory for it, and how the factory should operate internally.

What data we need, starting from motion capability

To answer “what data do we need,” we first need to answer “what motion capability do we want.”

For a general-purpose whole-body motion model, we need a motion capability that can upwardly accommodate multimodal motion intent, downwardly accommodate different embodied hardware, remain safe and reliable, and keep evolving in complex environments.

Such capability places higher demands on data: the model needs data that preserves whole-body coordination, task intent, contact relationships, environmental context, physical feasibility, and cross-embodiment reuse value all at once.

Yet the data formats available today, taken individually, are all hard to satisfy these requirements naturally:

Motion capture data can accurately and structurally record human motion states, but it lacks environmental information and precise interaction between the human and the environment;

Teleoperation data is tightly bound to a specific robot embodiment; once the hardware changes, its reuse value drops sharply;

First-person video focuses on the end effector and object interaction, and cannot fully express the whole-body coordination among torso, lower limbs, center of mass, and contact;

Third-person video can show the overall motion, but it is difficult to extract accurate and reasonable human motions from it. Each of these data types has value, but none alone is sufficient to support the data loop required by a general-purpose whole-body motion model.

Based on this judgment, we define the data assets truly oriented toward training a general-purpose whole-body motion model as cross-embodiment whole-body motion data (Cross-Embodiment Whole-Body Motion Data, CWM), and require CWM to satisfy at least the following four properties:

Cross-embodiment retargetability (Cross-embodiment retargetability)

The same motion segment must be processed through a unified pipeline and produce physically self-consistent training samples across multiple target embodiments with significantly different link lengths, joint configurations, mass distributions, and actuation capabilities. This means the raw data itself needs to carry enough topological and kinematic information to support a unified structural mapping across embodiments, rather than being locked into the joint space of a single robot. Robot hardware will keep iterating; if data only serves one generation of embodiments, it depreciates together with that generation of hardware. CWM binds data value to human whole-body motion semantics and transferable regularities, allowing one dataset to be settled repeatedly across multiple hardware generations.

Whole-body coverage (Whole-body coverage)

The data must fully express the torso, limbs, hands, fingers, and the coordination among them, rather than retaining only upper-body end trajectories or lower-body gait. Real tasks are often not simple concatenations of local motions, such as “squat down to pick up an object—lift it—turn and walk,” which simultaneously involve lower-limb support, center-of-mass transfer, torso posture, arm extension, finger grasping, and contact switching. Only by recording these coupled relationships of the body as a whole can the model learn the coordination rules between locomotion, manipulation, and posture change.

Physical feasibility (Physical feasibility)

A qualified sample is not just kinematically smooth and reasonable; it must also be physically feasible on the target embodiment in dynamics, with no floating, penetration, slipping, instability, or torque limit violations. This is the hard threshold for upgrading a CWM asset from a candidate trajectory into a training sample.

Multi-source augmentability (Multi-source augmentability)

During recording, CWM data synchronously captures human motion, semantic labels, first-person video, third-person video, environmental assets, and object assets, so that the motion carries complete body, task, and scene context. We then replay and augment the data in simulation, using custom camera positions, swapped scene and object material textures, and captured whole-body contact forces and motion states to expand a single capture into multi-view, multi-scene, and multi-physical-state training samples.

CWM data that satisfies these four properties cannot be obtained by simple capture alone, which is also the starting point for building a cross-embodiment whole-body motion data factory.

Why we need to build a data factory

We have defined what CWM data is, but for model training, merely “correct” data is not enough; data scale is equally critical, and this is already a consensus in the large-model field.

Research from Generalist AI points out that VLA models also exhibit a clear data scaling law; SONIC has likewise systematically verified in humanoid whole-body motion tracking that scaling motion data volume significantly improves motion control capability. For whole-body motion control, this means data must cover not just a few standard actions, but large numbers of continuous motion combinations such as walking, turning, squatting, carrying, grasping, supporting, obstacle avoidance, balance recovery, and contact switching.

Our internal judgment is that training a truly general whole-body motion model ultimately requires hundreds of thousands of hours of high-quality CWM data; at this scale, a small amount of data has almost no long-term training value. What truly matters is a data scale that can keep expanding.

At the same time, data diversity is equally important, because no matter how much walking data you collect, it still cannot train a model that can do a backflip. The complexity of whole-body motion data lies in the fact that it is not simply “the more motions, the better”; it must have the right data recipe and strict quality control.

The model needs to see enough motion categories, contact states, task semantics, environmental variations, and target embodiment differences; at the same time, each sample must undergo cleaning, annotation, retargeting, and physical validation. Otherwise, large-scale data easily turns into large-scale noise. Foot slippage, body penetration, floating, instability, and torque saturation are all direct data contaminants that lower model quality; they cause the model to learn incorrect contact relationships, incorrect body coordination patterns, and non-executable control modes.

This standard also means external data cannot be the main force: public motion capture and online video can serve as supplements, but they are insufficient in both quantity and quality to support training a general-purpose whole-body motion model.

Therefore, CWM data production must be designed as an industrialized production system, and capture is only one part of it. A motion, from being designed to entering the training set, must still go through quality inspection, cross-embodiment retargeting, dynamics and simulation augmentation, semantic annotation, and a feedback loop from the model training side.

This production line needs to define the data recipe, production process, and quality standards at the same time: which motions must be prioritized, which scenes and contact states are most scarce, which target embodiments need verification, which samples should be removed, and which data yields the highest training value all need to be continuously tracked and fed back. The larger the data scale, the less it can rely on manual experience; the more general the model goal, the more it needs a reproducible, auditable, and iterative production process.

This is also the core value of the CWM data factory: using stable sites, equipment, production lines, professional teams, and quality assurance systems to turn general-purpose whole-body motion data into a sustainable production capability.

Professional motion designers are responsible for defining the motion taxonomy, the capture team is responsible for high-quality synchronized recording, the engineering team is responsible for cleaning, formatting, retargeting, and simulation replay, the algorithm team is responsible for physical validation, training feedback, and data screening, and the quality inspection team keeps unusable samples out of the training set.

Only such a factory-level system can continuously produce CWM data assets that are large enough, accurate enough, clean enough, and able to keep updating alongside model training and robot iteration.

A data factory is not a “capture site” but an “infrastructure”

The BridgeDP cross-embodiment whole-body motion data factory is a full-process infrastructure built around CWM data asset production.

It starts with motion design, which defines motion categories, contact states, and task scenarios; in the capture stage, it synchronously acquires multi-source data such as human motion, video, contact, environment, and objects; then it transforms raw material into trainable samples through cross-embodiment retargeting, physical validation, and simulation augmentation; finally, it continuously refines the data recipe using training feedback.

Active coverage: enriching motion diversity

The first question a data factory must answer is “what should we collect?” General-purpose whole-body motion models need to see a continuously expanding motion space that covers body coordination patterns. This space cannot just be a stack of motion names; it must be continuously filled along several independent axes:

Horizontal expansion across capability dimensions

The capture plan should be organized by how the body is used, not by piling up motion names. Basic dimensions such as locomotion, posture transitions, limb coordination, contact switching, and object manipulation form the foundation for later complex capabilities. What we care about is how the body is recruited, how different body parts coordinate, and how center of mass and contact change, not whether a specific motion has been captured.

Complex terrain, multi-person interaction, and environment interaction

These three types of scenes are the most difficult beyond the basic dimensions and the closest to real deployment needs, yet they are also the easiest to miss, so they must be explicitly included in the capture plan. Complex terrain changes support and foot placement strategies, multi-person interaction introduces rhythm alignment and spatial negotiation, and environment interaction deeply couples body motion with objects, contact surfaces, and reachable space. They cannot be extrapolated naturally from flat-ground single-person motions; they must be explicitly scheduled into the capture plan.

Instinctive behavior and free performance

A script can only define the task boundary; in real motion, there are many parts that are never written down: individual motion habits, on-the-spot adjustments, and instinctive responses to unexpected events. Professional motion designers provide intent and constraints during recording while leaving room for performers to complete the motions in their own style, so the data covers both task objectives and real bodily differences.

Motion recovery and failure fallback

Whether the model can be deployed in the real world depends largely on whether it can remain stable when things go wrong. Motion recovery therefore needs to be included separately in the capture plan, including rebalancing after loss of balance, obstacle-avoidance recoil after collision, and standing recovery from a fall or other non-ideal posture. Such samples are usually scarce, but they directly determine the safety boundary of the model.

Capture diversity must also be managed explicitly at the source. The diversity of capture personnel and capture equipment directly affects the diversity and richness of CWM data: performers of different body shapes, ages, genders, and physiques bring differentiated motion postures, joint angle ranges, and center-of-mass control strategies; different capture devices (inertial motion capture, optical motion capture, electromagnetic motion capture) also contribute a dimension of data due to differences in precision, coverage, wearing constraints, and applicable scenarios. Only by bringing both personnel and equipment diversity into the capture plan can the model avoid learning only the motion style of “one kind of person on one kind of device.”

These directions are organized and measured through a continuously updated motion coverage map, which records which combinations have already been covered, which dimensions remain sparse, and which samples repeatedly fail after cross-embodiment transfer.

Beyond proactive coverage guided by the map, the data factory also explicitly receives data-type requirement feedback from the model-training side: which motion categories are unstable on which embodiments, which contact states yield the least training gain, and which samples passed quality inspection but did not produce actual improvement are all translated into new data-type requirements and written back into the capture plan, so that “what to collect” keeps being calibrated by training results.

To turn all these requirements into executable capture tasks, we have built an AI-native data design and recording management platform inside the factory, bringing motion requirements, coverage maps, scene assets, recording plans, data status, and training feedback into one system for management.

The platform’s core users are a group of full-time professional motion designers, who define motion semantics, break down body coordination, judge performance executability, and convert whole-body interaction, motion recovery, tool use, and scene tasks into recordable motion plans.

With built-in AI capabilities, the platform assists designers in creating motion plans along three dimensions:

For motion plan generation and expansion, the platform drafts motion descriptions based on coverage gaps and training feedback, performs semantic-level generalization, and derives a large number of variants across dimensions such as speed, body type, and rhythm;

For visual presentation of plans, AI can directly generate motion examples from text descriptions or motion keyframes, turning abstract descriptions into demonstrable reference motions;

For diversity review and personnel matching, the platform compares the distribution bias of the current batch against the coverage map, prompts designers about which dimensions are over-collected and which remain sparse, and, based on body shape, age, gender, and physique, helps assign each plan to the most suitable performer and capture device.

This toolchain closes the loop between the coverage map, designer judgment, and model-training feedback in one system, continuously turning “which motions have been learned stably, which motions have high transfer failure rates, and which scenes still lack coverage” into producible, reviewable, and feedback-ready production tasks.

Synchronized capture: synchronized alignment of multi-source information

CWM synchronized capture is not simply about recording a human motion segment; it must answer four questions within the same motion segment at the same time: motion intent, bodily motion pattern, interaction target, and environment. “Whole-body” means that sub-tasks such as locomotion, manipulation, posture control, and contact changes are all valid within the same motion segment, rather than degrading into a simple concatenation of torso, hand, and leg trajectories. This naturally requires human motion, video, semantics, and scene information to be recorded synchronously. Under the current capture specification, a complete record will, whenever possible, synchronize the following four types of signals; what is actually available depends on the capture scenario and target embodiment.

Human motion (BVH)

This is the main reference signal for cross-embodiment retargeting, carrying motion semantics, body coordination, center-of-mass change, and posture transitions. We use different devices for different motion types:

Low-dynamic motions and motions on complex terrain are suited to inertial motion capture, which is insensitive to site conditions, occlusion, and terrain;

High-dynamic motions are suited to optical motion capture or higher-precision optical-inertial hybrid systems, which can stabilize joint positions during rapid movement;

Fine end-effector motions such as grasping, tool use, button pressing, and twisting are suited to electromagnetic motion capture, which can provide high-precision pose data in a small space.

Raw video

It does not directly enter the retargeting pipeline, but in the data factory it is a high-value auxiliary signal: it supports video-based motion completion and human motion extraction, allowing massive amounts of internet video to be incorporated into training assets, and also providing a visual modality for navigation and manipulation preparation; it is also used to train SLAM and estimate contact states between humans and objects. On the device side, it is captured in parallel using head-mounted cameras and external RGB / RGB-D cameras, providing first-person and third-person viewpoints respectively.

Scene interaction assets

These provide the environmental and object context in which the motion occurs, and are a prerequisite for bringing the motion into simulation.

We capture two categories: one is terrain and scene assets—room structure, ground undulation, and fixed furniture, which determine the reachable space and contact surfaces; the other is interactive object assets—objects to be carried, pushed, pulled, or used, which determine the target geometry of manipulation tasks.

Technically, we use 3D Gaussian Splatting + mesh extraction for overall reconstruction, and further use optical marker tagging for objects that require precise pose. Once these assets enter the simulation environment, they support reinforcement learning training and model evaluation.

Semantic labels

These are jointly generated by professional motion designers, on-site recorders, and AI annotation systems to define motion boundaries, motion categories, scenes, and intent, determining how each sample enters the training set and how it is sampled, weighted, and evaluated during training.

The reason synchronization is mandatory is that the value of whole-body motion lies not in any single modality, but in the correspondence among modalities. For the same “squat down to pick up an object” motion, human BVH only explains how the body posture changes; video shows where the object is and whether the hand actually makes contact; scene assets describe the environment and interactive surfaces around the object; semantic labels define the motion boundary and task intent. If these signals are not aligned, we cannot determine which frame of object contact corresponds to the hand trajectory, nor can we tell whether foot force corresponds to the current posture, and we certainly cannot verify whether this motion segment is truly fit to enter the training set.

To this end, the data factory establishes a unified capture clock and timestamp system for all capture devices: before capture, all devices complete spatial calibration and time calibration; during capture, the master control system centrally manages task IDs, motion IDs, device status, and start / end signals; devices capable of hardware synchronization are aligned using trigger signals, frame sync, timecode, or PTP where possible, while devices that cannot be hardware-synchronized record high-precision local timestamps and use synchronized motions, calibration events, or post-processing algorithms for time-synchronization correction.

After synchronization, each sample must be organized into an asset that can directly enter the downstream pipeline, and this work is also completed by the aforementioned recording management platform.

The platform performs on-site automated quality inspection—checking time synchronization, calibration, trajectory completeness, bone-length stability, keypoint anomalies, and motion-segment boundaries, with AI-assisted checks of motion semantics, performance consistency, and obvious recording anomalies; it also performs unified ingestion—packing all modalities from the same motion segment into a single data package, binding session, device status, calibration version, time offset, frame drops, and quality inspection results, and aligning, resampling, and slicing them based on the master clock to form the minimum data contract that can directly enter the retargeting and training pipelines.

Cross-embodiment retargeting: retargeting to multi-configuration robots

The core solution to the heterogeneity problem is motion retargeting: converting a motion expressed in a human or other reference embodiment coordinate system into a trajectory on the target robot embodiment. In industrialized production, the challenge is no longer just “can a motion be transferred to one robot,” but whether it can be done continuously, stably, and at low cost across large numbers of motions and large numbers of embodiments.

At the algorithm level, our self-developed retargeting engine is designed for “any motion × any model × any terrain.” On the input side, it covers arbitrary motions, upper-body / lower-body / whole-body configurations, and can process offline motion-capture files, real-time motion-capture streams, and motion signals from video and other sources; on the output side, it covers legged, humanoid, upper-limb, and composite robots with significantly different structures, joint configurations, scales, and actuation capabilities, and can incorporate terrain constraints such as flat ground, slopes, stairs, and uneven surfaces into a unified solver without writing a dedicated solving logic for every motion, every robot, or every terrain type. The solver uses kinematic solving and geometric constraints as its backbone, bringing contact states, support relationships, spatial constraints, terrain constraints, joint limits, and body interaction relationships into the same solving process, and outputting candidate trajectories that are semantically consistent, structurally reachable, and stable in quality.

At the engineering level, it has three direct advantages for factory-scale production.

First, no per-sample tuning and no motion templates: cross-embodiment capability comes from a unified embodiment abstraction layer—when a new robot is introduced, we only rely on its URDF definition, and the algorithm can automatically and quickly adapt to multiple configurations on this abstraction layer, without writing dedicated solving logic for each motion or robot, and without manual fine-tuning for each sample.

Second, dual-mode streaming and offline processing: it can absorb motion streams entering in real time from the capture side, and it can also process existing motion libraries in batches; this makes retargeting no longer an “after-capture offline step,” but enables capture while retargeting—as soon as a motion is recorded, candidate trajectories on the target embodiment are already available, and quality inspection and subsequent dynamics augmentation can follow immediately. In streaming mode, our retargeting tool supports output data from multiple devices such as Noitom and Xsens.

Third, stable cross-platform distribution: it can be deployed and replayed in the same form across engineering stations, capture sites, training clusters, and target robots, so the motion stream always relies on the same algorithm implementation throughout the production chain.

At the capacity level, it is already the backbone production service of the factory. According to our current statistics, this retargeting algorithm can exceed 1000 frames per second on a single CPU core, roughly more than ten times the standard recording frame rate; we have prepared a compute cluster for this path so it can continuously absorb motion streams from the capture side and support parallel dispatch of the same motion to multi-configuration robots. In production terms, it compresses the hidden cost of “manual adaptation for every motion” into a one-time engineering calibration when a new embodiment is introduced, and compresses the chain time from “capture → retargeting → candidate training samples” from day-scale down to near real time.

The same human dance motion is retargeted across multiple robot embodiments, with some confidential embodiments blurred. Achieving this result does not require manual parameter tuning or additional configuration of the algorithm.

Data augmentation: dynamics, simulation, and AI annotation augmentation

Cross-embodiment retargeting outputs high-quality candidate trajectories, but candidate trajectories are not yet the final training asset. What data augmentation does is continue turning these candidate trajectories into data that is more verifiable, more trainable, and easier for the model to consume. We proceed along three main lines: dynamics augmentation, simulation diversity augmentation, and semantic annotation.

Dynamics augmentation places the most valuable, most difficult, and most physically demanding samples into the target embodiment’s dynamics and contact models, and uses RL-based dynamic post-processing to jointly control tracking error and physical violation, upgrading candidate trajectories from “kinematically plausible” to “trackable on the target embodiment, with no penetration, no torque overload, and no friction-cone violation.” Samples deemed infeasible are not discarded outright; instead, they enter quality feedback with specific failure reasons attached.

Simulation diversity augmentation repeatedly executes the same motion in different virtual environments, dramatically expanding the coverage density of CWM assets.

On one hand, it fills in missing modalities: through physics simulation and rendering pipelines, it adds force signals, depth maps, semantic segmentation, multi-view images, and other modalities that were not originally captured to samples that only contain motion and video;

On the other hand, it expands visual and scene diversity: swapping the texture assets of objects and environments, adjusting materials and lighting, changing room layouts, introducing new interactive objects and initial states, and applying external disturbances from different directions and strengths. The same motion can generate many new samples across multiple target embodiments, multiple scene sets, different lighting conditions, and multiple disturbance configurations, so the model sees not “one way to do this motion,” but “a distribution of ways to do this motion.”

Semantic annotation turns data into assets that can be retrieved, weighted, filtered, and reused by the training pipeline. The AI annotation system assists in generating motion segments, motion categories, contact states, scene objects, task semantics, failure reasons, and capability-dimension labels, while professional motion designers review semantic boundaries and key samples, converging the annotation output into a standard format usable for training sampling and evaluation bucketing.

The three augmentation types share the same version and provenance records: every augmented sample is marked with which original motion it came from, which target embodiment it passed through, which round of dynamics post-processing, which round of simulation expansion, which annotation version, and whether it passed physical validation. In this way, the training system can safely reuse, compare, and roll back augmented samples across versions, and quality feedback can pinpoint responsibility to a specific augmentation stage when problems occur.

After cross-embodiment retargeting (green markers), dynamics-consistent high-quality data produced through dynamics augmentation (red markers). Slippage, penetration, and floating are significantly reduced.

Quality feedback: bringing model training results back into production

Traditional motion-capture quality inspection mostly checks whether trajectories are clean; CWM data factory quality management must take two steps: first, layered checks along the production chain, and second, closed-loop feedback from model training results.

The first step is layered gating. A sample, from motion requirement to training set, must pass four independent quality gates in sequence. Together, these four layers screen a candidate sample into an asset that can enter training, but whether it can truly train a general-purpose whole-body motion capability can ultimately only be told by the model.

Design layer

Whether the motion requirement truly aligns with the capability gap, whether it covers the still-sparse cells in the motion coverage map, and whether it can be implemented as a field-executable motion plan. This layer controls “whether it should be collected.”

Raw data layer

Whether the performer fully expressed the design intent, whether capture was synchronized, whether calibration was in place, and whether there are basic recording problems such as frame drops, drift, keypoint anomalies, or unstable bone length. This layer controls “whether it was captured correctly.”

Retargeted data layer

Whether the candidate trajectory is structurally reachable on the target embodiment, whether joints exceed limits, whether contact relationships hold, and whether the motion semantics still remain valid after retargeting. This layer controls “whether it still holds on the target embodiment.”

Augmented data layer

After dynamic post-processing, whether it remains trackable, free of penetration, free of torque overload, and compliant with the friction cone; whether simulation expansion and semantic annotation carry the correct version and provenance records. This layer controls “whether it is truly effective once placed into the training set.”

The second step is result feedback closure. The training side will summarize each model evaluation result—such as which motion categories have learned stably on which embodiments, which transfers failed, which contact states yielded the lowest training gain, and which samples passed the four-layer gates but brought no actual improvement—into a retrievable failure profile: on which embodiment, under which motion category, at which contact state, and in which training version the issue occurred, and whether the root cause lies in design, raw capture, retargeting, or augmentation.

The failure profile is then written back directly to every upstream layer: the design layer uses it to adjust the priority of the motion coverage map and the recording plan; the raw data layer uses it to adjust capture specifications, synchronization strategy, and on-site quality thresholds; the retargeting layer uses it to iterate algorithm capability; the augmentation layer uses it to adjust the strength of dynamic post-processing, the simulation diversity configuration, and the annotation standard.

Put together, these two steps form a continuously iterative loop for the data factory. In actual operation, it advances along two lines: one is active coverage based on long-term judgment, continuously expanding the human whole-body motion library according to the motion coverage map; the other is feedback-based gap filling from the model-training side, backfilling every upstream layer according to the failure profile. With each cycle, the quality of the data assets, cross-embodiment coverage density, and training returns all rise a little at the same time: more accurate with every run, faster with every run is the core source of compounding value in the CWM data factory over time.

Closing Note: the current state and future of our data factory

Over the past three months, we have run the end-to-end chain of the cross-embodiment whole-body motion data factory in an internal pilot. The goal of this stage was not to pursue maximum throughput, but to truly get the entire production system running: whether motion design could be managed systematically, whether multi-source capture could remain stably aligned, whether retargeting could adapt quickly to new embodiments, whether augmentation and quality inspection could turn candidate trajectories into trainable assets, and whether training feedback could return to the next production cycle.

Along this chain, we have accumulated nearly one thousand hours of high-quality CWM data; the whole-body motion model trained on this dataset has ultimately completed key validation on more than a dozen legged robots with significant differences in structure, actuation performance, mass distribution, and inertia distribution.

The solution has now completed internal feasibility validation, and the data factory is about to complete formal construction. The next phase will shift from pilot validation to scaled production—expanding the site, capture studios, motion-capture equipment, motion design team, performer staffing, and algorithm / simulation / training compute clusters at the same time, so the pipeline that has already been proven can operate stably at a much larger scale.

Our goal, after the new factory is launched, is to establish a monthly production capacity of several thousand hours of high-quality CWM data for multi-configuration robots, and then climb in stages from “several thousand hours” to “tens of thousands of hours.” In this process, data quality, cross-embodiment reuse rate, and training gain will be continuously evaluated under the same production standard, so that every new batch of data can answer how many embodiments it can run on and which motion categories it delivers real training gains to, rather than merely “how many hours were captured.”