Learning to Act Robustly with
View-Invariant Latent Actions

1Graduate School of Data Science, Seoul National University
2Department of Electrical and Computer Engineering, Seoul National University
*Equal contribution, †Corresponding author

Abstract

Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization.

We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.

VILA Overview

VILA Overview. Our method learns view-invariant latent actions by aligning them using action-aware contrastive learning along with predicting the future.

Method

Our proposed framework consists of two stages: (i) Latent Action Learning, where we learn a compact, action-guided and view-invariant dynamics representation, and (ii) Latent Behavior Cloning, where we train a latent policy that predicts these latent actions.

1. Latent Action Learning

We build on top of Latent Action Models (LAM). The core insight is that invariance should be enforced not on a static scene-level visual representation, but rather on the dynamics. To achieve this, we introduce:

  • Action-Aware Contrastive Objective: Latent actions inferred from different viewpoints are pulled closer if their corresponding future GT action sequences are similar.
  • Structural Alignment: We align the global similarity structure of the latent action space with the ground-truth action space.

2. Latent Behavior Cloning

We train a latent policy to predict the learned latent actions from current observations. This policy serves as a view-robust vision encoder that conditions the downstream visuomotor policy.

Experimental Results

1. Unseen View Generalization (Simulation)

We compare VILA against two baselines (Vanilla, CLASS) across various simulation tasks. VILA consistently succeeds in unseen viewpoints, while baselines often fail.

Task: Square

VILA

Vanilla

CLASS

Task: Stack Three

VILA

Vanilla

CLASS

Task: Coffee

VILA

Vanilla

CLASS

Task: Mug Cleanup

VILA

Vanilla

CLASS

Quantitative Results

Success rates of VILA and baseline methods with respect to angular differences from training viewpoints.


2. Real-World Experiments

We validate our method on a real-world SO-ARM101 robot for Pick & Place and Drawer tasks. The figure below shows the setup for unseen views used in evaluation.

Real-World Unseen Views

Overview of the unseen camera viewpoints for Pick & Place (Top) and Drawer (Bottom) tasks.


VILA achieves significantly higher success rates on these unseen views compared to baselines. The table below summarizes the success rates (%) on real-world tasks.

Model Pick & Place Drawer
View 1 View 2 View 3 Avg. View 1 View 2 Avg.
VILA (Ours) 70.00 80.00 40.00 63.33 80.00 90.00 85.00
Vanilla 0.00 0.00 10.00 3.33 0.00 0.00 0.00
CLASS 10.00 30.00 0.00 13.33 0.00 0.00 0.00

Qualitative Result: Drawer Task (Unseen View)

We compare VILA against baselines on the Drawer task. We provide videos from two angles (Above, Below). VILA successfully opens the drawer, while baselines fail to align properly.

VILA (Ours)

Success

Vanilla

Fail

CLASS

Fail


3. Unseen Task Adaptation

We investigate whether representations learned on one dataset (Stack Three) provide useful priors for another task (Coffee). As shown in the graph, VILA provides a stronger prior than baselines, enabling data-efficient adaptation.

Task Adaptation

4. Representation Analysis

We analyze the quality of learned representations using qualitative visualizations (UMAP). The View UMAP is colored by camera ID (to check invariance), and the Action UMAP is colored by ground-truth action clusters (to check dynamics structure).

View Invariance (Colored by View ID)
View UMAP Before

Before Policy Training

View UMAP After

After Policy Training

Colors represent different camera views. VILA shows highly mixed distributions (view invariance) in both stages.

Action Semantics (Colored by Action Cluster)
Action UMAP Before

Before Policy Training

Action UMAP After

After Policy Training

Colors represent ground-truth action clusters. VILA maintains clear cluster structures (dynamics awareness).

Conclusion

We introduced VILA, a pre-training framework that enforces invariance on latent actions instead of scene-level visual features. By aligning latent spaces with control dynamics, VILA achieves consistent gains in unseen-view generalization and data-efficient task adaptation. This suggests that targeting invariance at the level of dynamics is a promising direction for robust visuomotor policies.

BibTeX


@misc{jeong2026learningactrobustlyviewinvariant,
      title={Learning to Act Robustly with View-Invariant Latent Actions}, 
      author={Youngjoon Jeong and Junha Chun and Taesup Kim},
      year={2026},
      eprint={2601.02994},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.02994}, 
}