CS839 Topics in Advanced Robotics · Spring 2026

Data Composition vs Architecture in Robot Imitation Learning

When does narrow-task training transfer? A real-robot study across task composition, spatial generalization, and three policy architectures.

Aswinkumar
University of Wisconsin-Madison · SO-ARM101 Pick-and-Place Evaluation
Successful single-sponge rollout
Single-object task: policies can solve the demonstrated setting.
Single-task policy failing on multi-sponge rollout
Compositional shift: single-sponge training can fail on multi-sponge inference.

This project asks whether generalization comes from architecture, from pretraining, or from the composition of the robot demonstration dataset.

Abstract

Imitation learning policies can reproduce demonstrated robot behavior in-distribution, but it is less clear when they learn reusable manipulation primitives rather than scene-specific shortcuts. I study this question on a real SO-ARM101 robot using a sponge pick-and-place task family. The study varies demonstration data composition and compares three policy families: ACT trained from scratch, SmolVLA, and Pi0.5.

Across 684 scored robot rollouts, the results show that data composition often dominates architecture choice. Marked-position datasets overfit, single-task training does not reliably compose to multi-object behavior, and small amounts of task-relevant cross-task data recover most of the benefit of universal training. The completed ST3/ST4 atlas further shows that distractor tasks introduce a different failure profile: color-selection errors become as important as grasping errors. Pretrained VLA policies help in some regimes, but they do not remove the need for carefully structured robot data.

384
Teleop demos
684
Scored rollouts
39
Policies
202
GPU-hours

Research Question

When does narrow-task training transfer to harder task variants in imitation learning, and how do architecture and data composition interact to determine compositional generalization?

Q1: Spatial generalization

How do known, random, mixed, and combined datasets affect position robustness?

Q2: Task composition

Does training on a single-object task transfer to multi-object or distractor settings?

Q3: Architecture

Do pretrained VLAs close the generalization gap relative to ACT trained from scratch?

Method

Collect demos

Known, random, mixed, combined, and cross-task teleoperation datasets.

Train policies

ACT, SmolVLA, and Pi0.5 checkpoints trained on matched dataset recipes.

Evaluate rollouts

Known/random weighted scenes, timing metrics, and failure-mode labels.

Task family

TaskDescriptionPurpose
ST1Pick one blue sponge and place it in the bowl.Base single-object skill.
ST2Pick two blue sponges and place both in the bowl.Multi-object composition.
ST3Pick the blue sponge in clutter with distractors.Visual distractor robustness.
ST4Pick multiple blue sponges with distractors.Held-out compositional probe.
Marked known positions in the robot workspace
Known scenes use marked table positions; random scenes move objects away from those positions.

ACT

Transformer encoder-decoder with action chunking. Trained from scratch. Fast inference, no pretrained visual-language priors.

SmolVLA

Open VLA-style policy with SmolVLM backbone and action expert. Fine-tuned from pretrained weights.

Pi0.5

Large VLA policy with visual-language backbone and flow-matching action head. Fine-tuned with frozen vision components.

Experimental Design

Dataset recipes

Each task has known-position, random-position, mixed, and combined datasets. Cross-task datasets combine demonstrations from multiple task families. The universal dataset acts as a broad-data reference.

RecipeMeaning
KKnown marked positions only.
RRandom positions only.
Mixed50/50 known and random demonstrations.
CombinedLarger known+random task-specific set.
Cross-taskHalf-mixes or universal mixtures across tasks.

Scoring protocol

Each evaluated cell uses 16 trials: 8 known and 8 random. Known successes receive 1.0 point; random successes receive 1.5 points. This produces a maximum score of 20 and penalizes policies that memorize marked positions.

The secondary metric is median time-to-success among successful trials, which captures decisiveness and trajectory quality.

iPad dashboard used during robot evaluation
The evaluation dashboard launched rollouts and recorded success, timing, and failure modes beside the robot.

Results

ST1 known versus random overfitting plot
Known-only training overfits marked positions: both ACT and SmolVLA score 10/20.

Marked positions are a trap

D1-K policies reach high known-position success but collapse on random positions. The same pattern appears across architectures, indicating that the failure is in the data distribution, not only in model capacity.

ST1 score by dataset and architecture
Cross-task data helps ST1, but the useful auxiliary data differs by architecture.

Cross-task data helps differently

ACT reaches 20/20 with D1+D3-Half and universal data, suggesting that explicit distractor exposure improves visual alignment. SmolVLA reaches 20/20 with D1+D2-Half and universal data, suggesting that multi-object behavior still needs direct evidence.

ST2 full results
Single-task D1 training fails on ST2; adding limited D2 data recovers much of the universal benefit.

Single-task training does not compose

D1-Combined to ST2 scores only 3.5/20 for ACT, 2.5/20 for SmolVLA, and 7.0/20 for Pi0.5. Adding 64 multi-sponge episodes through D1+D2-Half raises ACT to 18.5/20 and Pi0.5 to 17.0/20.

ST1 time-to-success plot
Broader training often reduces ST1 time-to-success.
ST2 time-to-success plot
ST2 timing reflects object ordering, regrasping, and decisiveness.
ArchitectureTaskDatasetKRScoreMedian time
ACTST1D1-K7210.0/2018.0 s
ACTST1D1+D3-Half8820.0/2026.5 s
SmolVLAST1D1+D2-Half8820.0/207.0 s
ACTST2D1-Combined213.5/2069.0 s
ACTST2D1+D2-Half8718.5/2085.0 s
SmolVLAST2D_Universal8820.0/2064.5 s
Pi0.5ST2D_Universal8820.0/20113.5 s
ST3 and ST4 success rates by architecture
Distractor tasks separate the architectures: ACT and Pi0.5 are much stronger than SmolVLA in ST3/ST4.

ST3/ST4 atlas: distractors change the problem

ST3 contains 144 scored distractor rollouts with 67 successes. ACT and Pi0.5 are close, with 28/48 and 29/48 successes respectively, while SmolVLA reaches 10/48. ST4 contains 60 held-out multi-sponge distractor rollouts with 35 successes: ACT reaches 17/20, Pi0.5 reaches 12/20, and SmolVLA reaches 6/20.

The key change is not just lower success. The failure vocabulary changes: color-selection errors become a first-class failure mode, especially in ST4. That makes distractor robustness a distinct capability from clean pick-and-place.

ST3 and ST4 dataset by architecture heatmap
Dataset-by-architecture view. Universal training is not uniformly best; the helpful data mixture depends on architecture and task.
ST3 and ST4 failure modes by architecture
Failure atlas for distractor tasks. ST3 is still grasp-heavy; ST4 is increasingly dominated by wrong-color selections.
Regular blue sponge used during training Irregular blue sponge shape used at inference
Shape generalization check: training used cuboidal blue sponges, but the policy could still pick an irregular sponge shape at inference.

Shape generalized, but material state still mattered

The image comparison is a positive result: although training used regular cuboidal sponges, the learned policy could still pick up an irregular sponge shape at inference. That suggests the policy was not only memorizing the exact cuboid outline.

The separate contact issue came from material state. A 30-second diagnostic clip showed that after a sponge was left out during robot work, it stiffened enough that the same pinch strategy used successfully in ST1/ST2 could make it jump out of the gripper. This gives a concrete physical explanation for why ST3/ST4 have elevated grasp-miss counts in addition to color-selection errors.

TaskArchitectureSuccessesFailure-mode summaryInterpretation
ST3ACT28 / 48G=15, C=4, D=1Mostly grasp-limited; distractors do not dominate.
ST3SmolVLA10 / 48G=24, C=14Both grasp and color-selection robustness are weak.
ST3Pi0.529 / 48G=10, C=8, D=1Strongest ST3 success, but color remains visible.
ST4ACT17 / 20G=2, C=1Best held-out compositional distractor result.
ST4SmolVLA6 / 20G=7, C=6, S=1Struggles with both selection and multi-object completion.
ST4Pi0.512 / 20C=6, G=1, S=1Main bottleneck shifts from grasping to selecting the right object.

Discussion and Limitations

01

Data can dominate architecture.

Both ACT and SmolVLA overfit marked-position data, while broader data lets even a from-scratch ACT policy perform strongly.

02

Composition needs direct evidence.

Single-object demonstrations teach useful primitives but not necessarily multi-object sequencing.

03

Failure modes reveal data gaps.

ST1 failures are mostly grasp misses; ST2 introduces drops and missed sponges; ST3/ST4 add wrong-color failures, showing that distractor robustness is a separate capability.

Failure mode distribution
Failure modes shift as the task changes from single-object grasping to multi-object sequencing.

Limitations. Each K/R split has only 8 trials, so individual cells have wide confidence intervals. A single checkpoint budget can favor faster-converging architectures. The robot setup uses one table, one object family, and one lighting regime, so deployment claims beyond this setup should be interpreted as hypotheses rather than final claims.

Conclusion

Compositional generalization in this robot imitation learning setting does not emerge automatically from architecture scale or VLA pretraining. It appears when the training data contains the right kinds of variation. Marked-position demonstrations produce brittle shortcuts; single-task demonstrations do not reliably compose; small cross-task mixtures can close much of the gap. The completed distractor atlas adds one more constraint: color and object-selection robustness must be trained and evaluated explicitly, because it does not reduce to clean pick-and-place success.

Lessons Learned

Beyond the headline metrics, the project produced practical lessons about running real robot learning experiments end-to-end.

  • Evaluation UX matters: good tooling increases data quality. The iPad dashboard made hundreds of rollouts feasible by removing CSV bookkeeping and letting failures be tagged immediately.
  • Data bugs look like model bugs: marked-position overfitting initially looked like architecture failure until the K/R split exposed the dataset shortcut.
  • Failure labels are a microscope: success rate alone hid the difference between grasping, sequencing, dropping, and wrong-color selection.
  • Small data can be decisive: D1+D2-Half showed that a small amount of the right multi-object data can beat larger but incomplete data.
  • Inference speed changes behavior: ACT runs faster than the VLA policies, so time-to-success reflects both learned strategy and control-loop frequency.
  • Infrastructure is part of the experiment: lighting, camera naming, video encoding, reset timing, and partial recordings all affected what could be measured reliably.
  • Objects have state: the sponge itself changed as it dried and stiffened, turning a previously reliable pinch into a different contact problem.

References

  1. T. Z. Zhao et al. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS, 2023.
  2. A. Padalkar et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. 2023.
  3. Hugging Face LeRobot project and SmolVLA model documentation.
  4. Physical Intelligence Pi0/Pi0.5 vision-language-action policy family.
  5. Project artifacts: scored trials, plots, slide deck, and larger media/data artifacts maintained outside the project page.

BibTeX

@misc{aswinkumar2026datacomposition,
  title  = {Data Composition vs Architecture in Robot Imitation Learning},
  author = {Aswinkumar},
  year   = {2026},
  note   = {CS839 Final Project, University of Wisconsin-Madison}
}