Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization.
We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
Our proposed framework consists of two stages: (i) Latent Action Learning, where we learn a compact, action-guided and view-invariant dynamics representation, and (ii) Latent Behavior Cloning, where we train a latent policy that predicts these latent actions.
We build on top of Latent Action Models (LAM). The core insight is that invariance should be enforced not on a static scene-level visual representation, but rather on the dynamics. To achieve this, we introduce:
We train a latent policy to predict the learned latent actions from current observations. This policy serves as a view-robust vision encoder that conditions the downstream visuomotor policy.
We compare VILA against two baselines (Vanilla, CLASS) across various simulation tasks. VILA consistently succeeds in unseen viewpoints, while baselines often fail.
VILA
Vanilla
CLASS
VILA
Vanilla
CLASS
VILA
Vanilla
CLASS
VILA
Vanilla
CLASS
Success rates of VILA and baseline methods with respect to angular differences from training viewpoints.
We validate our method on a real-world SO-ARM101 robot for Pick & Place and Drawer tasks. The figure below shows the setup for unseen views used in evaluation.
Overview of the unseen camera viewpoints for Pick & Place (Top) and Drawer (Bottom) tasks.
VILA achieves significantly higher success rates on these unseen views compared to baselines. The table below summarizes the success rates (%) on real-world tasks.
| Model | Pick & Place | Drawer | |||||
|---|---|---|---|---|---|---|---|
| View 1 | View 2 | View 3 | Avg. | View 1 | View 2 | Avg. | |
| VILA (Ours) | 70.00 | 80.00 | 40.00 | 63.33 | 80.00 | 90.00 | 85.00 |
| Vanilla | 0.00 | 0.00 | 10.00 | 3.33 | 0.00 | 0.00 | 0.00 |
| CLASS | 10.00 | 30.00 | 0.00 | 13.33 | 0.00 | 0.00 | 0.00 |
We compare VILA against baselines on the Drawer task. We provide videos from two angles (Above, Below). VILA successfully opens the drawer, while baselines fail to align properly.
VILA (Ours)
Success
Vanilla
Fail
CLASS
Fail
We investigate whether representations learned on one dataset (Stack Three) provide useful priors for another task (Coffee). As shown in the graph, VILA provides a stronger prior than baselines, enabling data-efficient adaptation.
We analyze the quality of learned representations using qualitative visualizations (UMAP). The View UMAP is colored by camera ID (to check invariance), and the Action UMAP is colored by ground-truth action clusters (to check dynamics structure).
Before Policy Training
After Policy Training
Colors represent different camera views. VILA shows highly mixed distributions (view invariance) in both stages.
Before Policy Training
After Policy Training
Colors represent ground-truth action clusters. VILA maintains clear cluster structures (dynamics awareness).
We introduced VILA, a pre-training framework that enforces invariance on latent actions instead of scene-level visual features. By aligning latent spaces with control dynamics, VILA achieves consistent gains in unseen-view generalization and data-efficient task adaptation. This suggests that targeting invariance at the level of dynamics is a promising direction for robust visuomotor policies.
@misc{jeong2026learningactrobustlyviewinvariant,
title={Learning to Act Robustly with View-Invariant Latent Actions},
author={Youngjoon Jeong and Junha Chun and Taesup Kim},
year={2026},
eprint={2601.02994},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2601.02994},
}