CS839 Topics in Advanced Robotics · Spring 2026

Data Composition vs Architecture in Robot Imitation Learning

When does narrow-task training transfer? A real-robot study across task composition, spatial generalization, and three policy architectures.

384 teleop demos  ·  684 scored rollouts  ·  39 policies  ·  202 GPU-hours  ·  1 SO-ARM101

Aswinkumar
University of Wisconsin-Madison · SO-ARM101 Pick-and-Place Evaluation
In-distribution success
Successful single-sponge rollout
Single-object task: policies can solve the demonstrated setting.
Compositional failure
Single-task policy failing on multi-sponge rollout
Compositional shift: single-sponge training can fail on multi-sponge inference.

This project asks whether generalization comes from architecture, from pretraining, or from the composition of the robot demonstration dataset.

Abstract

Imitation learning policies can reproduce demonstrated robot behavior in-distribution, but it is less clear when they learn reusable manipulation primitives rather than scene-specific shortcuts. I study this question on a real SO-ARM101 robot using a sponge pick-and-place task family. The study varies demonstration data composition and compares three policy families: ACT trained from scratch, SmolVLA, and Pi0.5.

Across 684 scored robot rollouts, the results show that data composition often dominates architecture choice. Marked-position datasets overfit, single-task training does not reliably compose to multi-object behavior, and small amounts of task-relevant cross-task data recover most of the benefit of universal training. Adding visual distractors changes the failure profile itself: color- and object-selection errors become as important as grasping errors. Pretrained VLA policies help in some regimes, but they do not remove the need for carefully structured robot data.

Beyond the headline metrics, three engineering lessons stood out. Data bugs look like model bugs — marked-position overfitting initially read as an architecture failure until the K/R split exposed the dataset shortcut. Evaluation UX is part of the experiment — a custom iPad dashboard that tagged failure modes inline made 684 rollouts tractable in a weekend without transcription errors. And objects have state — the same blue sponge stiffens when left out, turning a previously reliable pinch grip into a different contact problem, which physically explains some of the ST3/ST4 grasp misses.

Research Question

When does narrow-task training transfer to harder task variants in imitation learning, and how do architecture and data composition interact to determine compositional generalization?

Q1: Spatial generalization

How do known, random, mixed, and combined datasets affect position robustness?

Q2: Task composition

Does training on a single-object task transfer to multi-object or distractor settings?

Q3: Architecture

Do pretrained VLAs close the generalization gap relative to ACT trained from scratch?

Method

Collect demos

384 teleop episodes across 3 tasks, organized into 16 dataset recipes (known / random / mixed / combined / cross-task).

Train policies

39 checkpoints across ACT (scratch), SmolVLA (full fine-tune), Pi0.5 (VLM-frozen fine-tune) — 202 GPU-hours total.

Evaluate rollouts

684 scored real-robot rollouts on K / R weighted scenes with timing and failure-mode labels.

Task family

The four scoring tasks (ST1–ST4) form a compositional ladder along two independent axes: number of target objects (one vs. many) and presence of distractors (clean scene vs. clutter). ST1–ST3 have matching teleop datasets (T1, T2, T3); ST4 has no dedicated training data and acts as a held-out compositional probe combining the multi-object and distractor axes.

TaskTarget spongesDistractorsDescriptionCapability probedTraining data
ST11 blueNonePick one blue sponge and place it in the bowl.Base single-object grasp + place.T1 (128 eps)
ST21–6 blueNonePick every blue sponge in the scene and place them all in the bowl (count varies 1–6 per scene).Multi-object sequencing, re-engagement.T2 (128 eps)
ST31 blueYes — colored distractorsPick the blue sponge among visually similar distractors.Color / object selection robustness.T3 (128 eps)
ST41–6 blueYes — colored distractorsPick every blue sponge from a cluttered scene (count varies 1–6 per scene).Joint composition: selection × sequencing.None — held-out

Sponge color, bowl, and gripper geometry are fixed across all tasks; only object count and distractors change. This isolates compositional generalization from low-level perception changes.

1. Collect demos — from teleoperation to 16 dataset recipes

All demonstrations were teleoperated on a single SO-ARM101 with a leader/follower pair. 384 episodes were collected in ~5.7 hours of recording across three base tasks (T1 = single sponge, T2 = multi-sponge, T3 = clutter / distractor), with 128 episodes per task. Each task has known-position, random-position, mixed, and combined splits, plus four cross-task mixtures, giving 16 named datasets:

TaskK (known)R (random)MixedCombinedCross-task
T1D1-KD1-RD1-MixedD1-CombinedD_Universal · D_D2D3
D_D1D2-Half · D_D1D3-Half
T2D2-KD2-RD2-MixedD2-Combined
T3D3-KD3-RD3-MixedD3-Combined

How to read the dataset names

For each base task we collected 128 episodes = 64 known + 64 random. From those we built four task-specific subsets (D1/D2/D3 stand for tasks T1/T2/T3):

NameWhat it containsEpisode count
DX-KOnly the 64 known-position demos.64
DX-ROnly the 64 random-position demos.64
DX-MixedA 50/50 mix: 32 known + 32 random.64
DX-CombinedEverything for that task: all 64 known + 64 random.128

The four cross-task mixtures combine those subsets across tasks:

NameMade fromEpisode count
D1+D2-HalfD1-Combined (128) + D2-Mixed (64) — single-sponge plus a half-portion of multi-sponge.192
D1+D3-HalfD1-Combined (128) + D3-Mixed (64) — single-sponge plus a half-portion of distractor.192
D_D2D3D2-Combined (128) + D3-Combined (128) — multi-sponge plus distractor, no single-sponge.256
D_UniversalD1 + D2 + D3 Combined — every demonstration we collected.384

So when a results table shows D1+D2-Half on ST2, it means: trained on all 128 single-sponge demos plus 64 multi-sponge demos, evaluated on the multi-sponge task. The "Half" suffix flags that only half of the secondary task's data is used — a deliberate test of how little auxiliary data is enough.

The dataset is open-sourced on Hugging Face: aswinkumar99/LeRobot-SO101-Pick-Place.

T1 — single sponge
T1 teleop demo: single blue sponge to bowl
Pick one blue sponge, place in bowl. 128 episodes.
T2 — multi-sponge
T2 teleop demo: multiple blue sponges to bowl
Pick every blue sponge in the scene (1–6 per scene) sequentially. 128 episodes.
T3 — clutter / distractors
T3 teleop demo: blue sponge among distractors
Pick blue sponge from a cluttered scene. 128 episodes.

2. Train policies — three architectures, three training regimes

Each of the three architectures was trained on the same dataset recipes, but with a different pretraining/fine-tuning regime. This is the core architecture comparison.

ACT — trained from scratch

~84 M params · Transformer encoder–decoder + CVAE

  • No visual or language priors — every weight learned from this dataset.
  • Action Chunking Transformer (Zhao et al., ALOHA), predicts a horizon of actions per observation.
  • --policy.type=act · 60 000 steps · batch 32 · checkpoints every 15 000 steps.
  • Cameras: overhead + wrist (native names).

SmolVLA — full fine-tune

~450 M params · SmolVLM-2 (256 M) + action expert (~200 M)

  • Open-source VLA from Hugging Face, pretrained on community LeRobot datasets.
  • Full fine-tune of the entire stack (VLM backbone and action expert) on each dataset recipe.
  • --policy.path=lerobot/smolvla_base · 20 000 steps · batch 128 (universal) / 64 (splits) · save every 5 000.
  • Cameras renamed: overhead → camera1, wrist → camera2.

Pi0.5 — VLM-frozen fine-tune

~3.4 B params · PaliGemma (3 B) + flow-matching action head

  • Physical Intelligence π₀.₅ — cross-embodiment pretraining, flow-matching action head.
  • VLM backbone frozen via freeze_vision_encoder=true; only the action expert trains (train_expert_only=true).
  • --policy.path=lerobot/pi05_base · BF16 · 20 000 steps · batch 32 · LR 5e-5 · 1 000-step warmup.
  • Cameras renamed: overhead → base_0_rgb, wrist → right_wrist_0_rgb.

Training split — 202 GPU-hours across two clusters

202
Total GPU-hours
ACT  48 h
SmolVLA  72 h
Pi0.5  46 h
DiT (excluded) 36 h
ACT ≈ 3 h × 16 = 48 h  ·  SmolVLA ≈ 4.5 h × 16 = 72 h  ·  Pi0.5 ≈ 6.6 h × 7 = 46 h  ·  DiT ≈ 9 h × 4 = 36 h

Blackwell pool — ACT & SmolVLA

6 × RTX PRO 6000 Blackwell · 26 h wall-clock · 156 GPU-hours

  • 32 models: ACT × 16 (~3 h each) + SmolVLA × 16 (~4.5 h each).
  • Docker image huggingface/lerobot-gpu, launched via train_matrix.sh / train_combo_matrix.sh.

H200 — Pi0.5

1 × NVIDIA H200 · 46 h wall-clock · 46 GPU-hours

  • 7 Pi0.5 models (3 single-dataset + 4 combo).
  • BF16, batch 32, action expert only — VLM backbone frozen.
ArchitecturePretrainingTraining regimeInference rate (RTX 4090 Mobile)
ACTNoneFrom scratch · 60 K steps30 Hz
SmolVLASmolVLM-2 + community LeRobot dataFull fine-tune · 20 K steps~30 Hz (occasional drops)
Pi0.5PaliGemma + cross-embodiment robot dataVLM frozen, action expert only · 20 K steps~6 Hz

DiT × 4 was attempted but excluded after pre-flight (zero in-distribution success). ACT and SmolVLA both run near 30 Hz on the laptop GPU; only Pi0.5 is materially slower (~6 Hz), and that gap is relevant to its time-to-success numbers.

3. Evaluate rollouts — dual metrics on identical scenes

Each evaluated cell (one architecture × one dataset × one task) uses 16 trials: 8 Known + 8 Random. Identical pre-generated scenes are replayed across all models for fairness.

6 marked known positions on the bench
Known scenes use the 6 marked positions visible on the bench; random scenes place sponges uniformly off the marks.

K vs R split

  • Known (8 trials × 1.0) — sponges placed on the 6 marked positions used during teleop.
  • Random (8 trials × 1.5) — sponges placed uniformly anywhere on the bench, off the marks.
  • The 1.5× weight makes random performance the dominant signal, so a policy that memorizes marks caps at 8/20.
Score = (K_wins × 1.0) + (R_wins × 1.5)   →   maximum 20

The secondary metric is median time-to-success among successful trials — a proxy for decisiveness and trajectory quality.

iPad dashboard used during robot evaluation
Custom iPad dashboard: launch rollouts, tag failure modes (G/C/S/P/D/O), and finish episodes early — 480 rollouts in a weekend without transcription errors.

Failure-mode taxonomy

  • G — grasp miss (closed on air, knocked sponge, slipped)
  • C — color/object selection error (grabbed a distractor)
  • S — sequencing error (one or more sponges left in the scene)
  • P — placement miss (dropped outside bowl)
  • D — mid-air drop after successful grasp
  • O — other (timeout, safety stop)

One-tap tagging in the dashboard removed CSV bookkeeping; labels feed directly into the ST3/ST4 failure atlas.

384
Teleop demos
684
Scored rollouts
39
Policies
202
GPU-hours

Results

All scored cells at a glance

Every cell evaluated in this study — 45 (architecture × dataset × task) combinations — summarized as a single score (Known × 1.0 + Random × 1.5). Cell color is a quick visual guide: strong, partial, weak, failing. Dashes mark untrained combinations. The subsections below walk through these results one phenomenon at a time.

Dataset ST1 (max 20) ST2 (max 20) ST3 (max 20) ST4 (max 15)
ACTSmolPi0.5 ACTSmolPi0.5 ACTSmolPi0.5 ACTSmolPi0.5
D1-K 1010
D1-R 1713
D1-Mixed 1217
D1-Combined 1716.518.5 3.52.57.0
D2-Combined 15.514.517.0
D3-Combined 5.04.015.5
D1+D2-Half 15.52017.5 18.513.017.0
D1+D3-Half 201714.5 14.03.09.0
D_D2D3 12.03.07.5
D_Universal 202018.5 19.02020 15.55.013.0 13.56.010.5

ST4 has no dedicated training set and is evaluated only on the random-position split (10 trials × 1.5 = max 15), so its scores are not directly comparable to ST1–ST3. The analyses below walk through the dataset/architecture interactions cell by cell.

Sample rollouts — one policy per task

One representative successful rollout for each of the four scoring tasks, overhead camera only. These are real policies running on the SO-ARM101, all played back at 4× speed for compactness.

ST1 — single sponge SmolVLA · D1+D2-Half · 4×
ST2 — multi-sponge SmolVLA · D_Universal · 4×
ST3 — clutter / distractors Pi0.5 · D3-Combined · 4×
ST4 — multi-sponge + distractors ACT · D_D2D3 · 4×

Many more rollouts are browseable in the full eval_gallery/.


Click any plot below to open it full-size.

Inference 1. Marked positions are a trap.

Training only on the 64 known-position demos (DX-K) lets a policy nail the marked positions but collapses on random ones. Both ACT and SmolVLA land at exactly 10/20 on ST1 from D1-K — the score a policy gets by perfectly handling the 8 known trials and missing all 8 random ones. The pattern holds across architectures, so the failure is in the training distribution, not the model class.

ST1 known versus random overfitting plot
Known-only training overfits marked positions: both ACT and SmolVLA score 10/20 on ST1.

Inference 2. Single-task training does not compose to multi-object (ST2).

Why we expected it to: humans who can pick one sponge can trivially pick several — each additional sponge is the same skill, repeated. Large language models show analogous behavior: a model that learns to add two numbers usually generalizes to adding three. So a reasonable prior is that a policy fluent in single-sponge picking should compose into multi-sponge picking with little or no additional data.

What actually happened: a policy trained on all 128 single-sponge demos (D1-Combined) does not stitch together a multi-sponge rollout on its own. ST2 scores from D1-Combined are 3.5 (ACT), 2.5 (SmolVLA), 7.0 (Pi0.5). Adding just 64 multi-sponge episodes (D1+D2-Half) raises ACT to 18.5 and Pi0.5 to 17.0 — close to the full D_Universal result. SmolVLA recovers more partially (13.0) and only reaches ceiling with the full universal mixture. Small amounts of the right data beat larger amounts of the wrong data — though the T1 → ST3 and T1 → ST4 analogues weren't trained, so the claim is anchored to ST2.

ST2 full results
ST2 scores by dataset × architecture. D1+D2-Half recovers most of the universal benefit.

Inference 3. Cross-task data helps.*

* Terms & conditions apply — the useful mixture differs by architecture, and we only tested this on ST1.

Looking only at the ST1 cross-task cells:

So the two VLAs lean toward extra multi-object demonstrations, while the from-scratch ACT leans toward extra distractor demonstrations. One plausible hypothesis: the VLAs already inherit strong visual priors from pretraining, so the marginal value of more data is in additional grasping/sequencing instances; ACT trains its visual encoder from scratch on this dataset alone, so distractor exposure helps it build a more discriminative visual representation. This is a hypothesis, not a measurement — the per-cell n = 16 trials and the result is only checked on ST1.

ST1 score by dataset and architecture
ST1 score by dataset × architecture — ACT and the two VLAs prefer opposite auxiliary mixtures.

Inference 4. On single-target tasks (ST1, ST3), broader data makes policies more decisive.

Time-to-success is only meaningful on single-target tasks, where a faster successful rollout reasonably implies a more confident policy. On multi-object tasks (ST2, ST4) the total time depends on how many sponges get picked and in what order, so a longer run can mean more thoroughness rather than less decisiveness — we omit those.

We also drop Pi0.5 from this analysis: it runs at ~6 Hz on the laptop GPU vs ~30 Hz for ACT and SmolVLA, so its absolute times are dominated by control-loop rate rather than policy decisiveness. The two 30 Hz architectures are directly comparable.

ST1 time-to-success plot
ST1 median time-to-success — broader training reduces completion time.
ST3 time-to-success plot
ST3 median time-to-success — trend is much weaker than ST1.
TaskDatasetACTSmolVLA
ST1D1-K18.0 s21.0 s
D1-Combined3.5 s24.0 s
D1+D2-Half28.0 s7.0 s
D_Universal5.5 s6.0 s
ST3D3-Combined31.5 s43.0 s
D1+D3-Half56.0 s39.0 s
D_Universal29.0 s38.5 s

ST1 is clean for SmolVLA: 21 s → 6 s as data broadens. ACT is noisier (the 3.5 s D1-Combined cell is suspicious — small successful-trial count), but D_Universal (5.5 s) is still faster than D1-K (18 s). On ST3 the trend is much weaker — ACT goes 31.5 s → 29 s and SmolVLA 43 s → 38.5 s, single-digit improvements that could easily be noise. So the "decisiveness" story is real on ST1 and at best directional on ST3.

Inference 5. Emergence of architectural differences — distractor tasks (ST3/ST4) finally pull the three policies apart, and the split looks like a vision-backbone story.

On the 144 scored ST3 rollouts: ACT 28/48, Pi0.5 29/48, SmolVLA 10/48. On the held-out 60 ST4 rollouts: ACT 17/20, Pi0.5 12/20, SmolVLA 6/20. The architectures that looked roughly tied on ST1/ST2 diverge sharply once visual distractors are introduced.

Our hypothesis is that the ordering tracks how each architecture's vision-language backbone is treated during training:

So the two architectures that don't disturb a competent visual representation (ACT specializes one; Pi0.5 preserves one) do well; the one that disturbs without replacing lags. This is a hypothesis consistent with the ranking, not a controlled ablation — we did not train a frozen-backbone SmolVLA or a from-scratch Pi0.5 to isolate the effect.

ST3 and ST4 success rates by architecture
ST3/ST4 success rates by architecture — distractor tasks pull the three policies apart along their vision-backbone treatments.

Inference 6. Emergence of architectural differences — universal training isn't uniformly best on ST3/ST4.

Pulling the ST3/ST4 cells from the score matrix:

TaskDatasetACTSmolVLAPi0.5
ST3 (max 20)D3-Combined5.04.015.5
D1+D3-Half14.03.09.0
D_Universal15.55.013.0
ST4 (max 15)D_D2D312.03.07.5
D_Universal13.56.010.5

Pi0.5 actively prefers in-task data on ST3 (D3-Combined 15.5 > D_Universal 13.0), while ACT does the opposite (D3-Combined 5.0 → D_Universal 15.5). This fits the Inference 5 hypothesis: Pi0.5's frozen visual backbone already knows "blue sponge", so adding T1/T2 single-color demos just dilutes its action expert; ACT's scratch-trained vision needs every bit of cross-task exposure.

The failure vocabulary also shifts from grasp-bound (ST3) to color-selection-bound (ST4), most sharply for Pi0.5:

TaskArchitectureSuccessesGrasp (G)Color (C)Other
ST3ACT28 / 48154D=1
ST3SmolVLA10 / 482414
ST3Pi0.529 / 48108D=1
ST4ACT17 / 2021
ST4SmolVLA6 / 2076S=1
ST4Pi0.512 / 2016S=1

So distractor robustness is genuinely a distinct capability from clean pick-and-place — it has to be trained and evaluated explicitly, and the bottleneck (grasping vs selection) differs by architecture.

Inference 7. Miscellaneous observations — shape generalized, but sponge material state didn't.

A positive surprise: although training used regular cuboidal sponges, the learned policy could still pick up an irregularly shaped blue sponge at inference. So it isn't memorizing the exact outline.

Regular blue sponge used during training Irregular blue sponge shape used at inference
Training shape (left) vs. an irregular blue sponge that the policy still picked up (right).

The harder story was material state. A 30-second diagnostic clip showed that after a sponge was left out during robot work, it stiffened enough that the same pinch strategy used successfully in ST1/ST2 could make it pop out of the gripper. That gives a concrete physical explanation for the elevated grasp-miss counts on ST3/ST4 — alongside the color-selection errors.

Side-by-side: pressing a stiff sponge vs a fresh sponge by hand
Stiff vs. fresh sponge, pressed by hand (4–8 s and 8–12 s of the diagnostic clip).

Discussion and Limitations

01

Data dominates architecture on the easy tasks.

On ST1 and ST2, every architecture overfits marked-position data the same way, and broader mixtures lift all of them to ceiling. The choice of architecture is largely irrelevant when the dataset already covers the relevant variation.

02

Architecture re-emerges once distractors appear.

On ST3/ST4 the three policies finally separate: ACT and Pi0.5 stay competitive while SmolVLA collapses. The split tracks how each architecture treats its vision-language backbone — ACT trains a scratch encoder specialized to this dataset, Pi0.5 keeps a frozen pretrained VLM, and SmolVLA full-fine-tunes a smaller VLM. The two that don't disturb a competent visual representation do well.

03

Composition needs direct evidence.

Single-object demonstrations teach useful primitives but do not reliably compose to multi-object sequencing — adding even 64 multi-sponge episodes recovers most of the universal benefit.

04

Failure vocabulary shifts with the task.

ST1 failures are mostly grasp misses; ST2 introduces sequencing/drops; ST3 is still grasp-heavy; ST4 becomes color-selection-bound, most sharply for Pi0.5. Distractor robustness is a distinct capability from clean pick-and-place.

05

A small manual assist unsticks multi-object retries.

On ST2/ST4, when a policy missed a grasp it often re-attempted the same object repeatedly instead of moving on. Nudging the object slightly closer to the gripper — a small physical assist — let the policy finish the rest of the rollout cleanly. We don't fully understand why this works: the policy isn't stuck in a degenerate action sequence (its other behaviors are intact), and the assist is too small to change the visual scene meaningfully. One possibility is that the policy's grasp distribution has a sharp basin around demonstration positions, and the assist nudges the object back into that basin.

SmolVLA · D_Universal on ST2 (4× speed). A small assist mid-rollout lets the policy complete the second sponge.
Failure mode distribution
Failure modes shift as the task changes from single-object grasping to multi-object sequencing to distractor selection.

Limitations.

  • Statistical power: 8 K + 8 R trials per cell, so any single-cell gap of ≤ 2 trial flips (≈ 3 points) is within sampling noise.
  • Vision-backbone hypothesis is not ablated: the scratch / frozen / full-FT story (Inf 5) fits the ranking, but we did not train a frozen-backbone SmolVLA or a scratch-Pi0.5 to isolate the effect.
  • Coverage gaps: the K-only overfitting cell (Inf 1) was trained only for D1 on ST1; the T1 → ST3 / T1 → ST4 single-task composition analogues of Inf 2 were never trained.
  • Time-to-success interpretation: only meaningful on single-target tasks (ST1, ST3); Pi0.5's ~6 Hz control loop confounds its absolute times even there.
  • Fixed checkpoint budget can favor faster-converging architectures; single table, lighting, sponge family, and gripper — deployment claims beyond this setup are hypotheses.

Conclusion

Compositional generalization in this robot imitation learning setting does not emerge automatically from architecture scale or VLA pretraining. It appears when the training data contains the right kinds of variation: marked-position demonstrations produce brittle shortcuts, single-task demonstrations do not reliably compose, and small cross-task mixtures can close much of the gap.

But "data dominates architecture" is only the easy-task story. Once distractors enter (ST3/ST4), the three architectures finally separate, and the ordering tracks how each treats its vision-language backbone: ACT trains a scratch encoder specialized to this dataset, Pi0.5 keeps a frozen pretrained VLM, and SmolVLA full-fine-tunes a smaller VLM. The two that don't disturb a competent visual representation do well; the one that disturbs without enough data to rebuild it lags. We don't claim this as a controlled finding — only a hypothesis consistent with the ranking — but it suggests that how a VLA is fine-tuned may matter as much as whether it has a VLA backbone at all.

Two further constraints follow from the distractor atlas: color- and object-selection robustness must be trained and evaluated explicitly, because it does not reduce to clean pick-and-place success; and Pi0.5's preference for in-task data on ST3 (over the universal mixture) shows that "more data" is not always the right answer for architectures with strong pretrained priors.

Lessons Learned

Beyond the headline metrics, the project produced practical lessons about running real robot learning experiments end-to-end.

  • Evaluation UX matters: good tooling increases data quality. The iPad dashboard made hundreds of rollouts feasible by removing CSV bookkeeping and letting failures be tagged immediately.
  • Data bugs look like model bugs: marked-position overfitting initially looked like architecture failure until the K/R split exposed the dataset shortcut.
  • Failure labels are a microscope: success rate alone hid the difference between grasping, sequencing, dropping, and wrong-color selection — and surfaced the G→C shift between ST3 and ST4.
  • Small data can be decisive: D1+D2-Half showed that a small amount of the right multi-object data can beat larger but incomplete data.
  • More data isn't always better: Pi0.5 scores higher with D3-Combined than D_Universal on ST3 — strong pretrained vision priors can be diluted by extra off-task demos.
  • How a VLA is fine-tuned matters: full fine-tuning a smaller VLM on a few hundred episodes (SmolVLA) underperforms both a frozen pretrained VLM (Pi0.5) and a scratch-trained encoder (ACT) on distractor tasks. The vision pathway's training regime is itself a design choice.
  • Inference speed only sometimes matters: ACT and SmolVLA both run near 30 Hz; only Pi0.5 is meaningfully slower (~6 Hz). Time-to-success comparisons across architectures are only clean when control-loop rate matches.
  • Time-to-success doesn't mean the same thing on every task: on single-target tasks it's a decisiveness signal; on multi-object tasks it confounds policy quality with rollout length, so we restricted that analysis to ST1 and ST3.
  • Infrastructure is part of the experiment: lighting, camera naming, video encoding, reset timing, and partial recordings all affected what could be measured reliably.
  • Objects have state: the sponge itself changed as it dried and stiffened, turning a previously reliable pinch into a different contact problem.

Acknowledgements

This work was carried out as the final project for Prof. Mike Hagenow's CS839 "Topics in Advanced Robotics" course at the University of Wisconsin–Madison. Thank you to Prof. Hagenow for the course, feedback, and support that made the real-robot evaluation possible.

The training stack and SO-ARM101 dataset tooling are built on the open-source LeRobot project from Hugging Face — thanks to the community for the policies, checkpoints, and dataset conventions.

The 6 × RTX PRO 6000 Blackwell GPU-hours that trained the ACT and SmolVLA matrices were provided by CloudRift as part of their open-research GPU credits program. CloudRift's hourly Blackwell access made it practical to run 32 training jobs in parallel and finish the full architecture × dataset matrix in a single weekend.

BibTeX

@misc{aswinkumar2026datacomposition,
  title  = {Data Composition vs Architecture in Robot Imitation Learning},
  author = {Aswinkumar},
  year   = {2026},
  note   = {CS839 Final Project, University of Wisconsin-Madison}
}