PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Youngjoon Jeong Jihwan Yu Minsoo Jo Junha Chun Taesup Kim*
Seoul National University
*Corresponding author

PoLAR learns latent actions from observation pairs by separating transition extent from transition mode.

Abstract

Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode.

PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs.

Overview

PoLAR factorizes transition extent and mode in latent actions.
Figure 1. PoLAR uses temporal offset to order transition extent along radius, allowing similar transition modes to remain in similar directions. Sweeping the radius token with fixed direction increases decoded transition extent.

Evaluation Tasks

Evaluation tasks for PoLAR.
Figure 2. PoLAR is evaluated across simulated and real-world tabletop manipulation tasks, including RoboMimic, MimicGen, SimplerEnv-WidowX, and real robot tasks.

Results

Simulation results for PoLAR.
Figure 3. PoLAR improves continuous latent-action conditioned diffusion policies on RoboMimic and MimicGen, and gives the best VLA success rates on SimplerEnv-WidowX.
Real-world robot results for PoLAR.
Figure 4. PoLAR with VLA achieves the highest success rates across three real-robot tasks.

Real Robot Rollouts

Rollouts shown at 2x playback.

Pick & Place Banana

PoLARSuccess

PoLARSuccess

UniVLAFailed

Villa-XFailed

Cup Stack

PoLARSuccess

PoLARSuccess

π0.5Failed

UniVLAFailed

Open Pot & Banana

PoLARSuccess

PoLARSuccess

UniVLAFailed

π0.5Failed

Model Zoo

Public PoLAR checkpoints are available on Hugging Face.

Tokenizer

PoLAR Tokenizer

Radial-direction latent action tokenizer pretrained on BridgeData V2.

Hugging Face
VLA

PoLAR VLA

Latent VLA trained with PoLAR action tokens on BridgeData V2.

Hugging Face

Analysis

Temporal offset and radial supervision diagnostics.
Figure 5. Temporal offset is an effective proxy for transition extent, and PoLAR radii increase with temporal offset while flat baselines remain nearly constant.
Radius controls transition extent.
Figure 6. With direction tokens fixed, increasing the radial token produces progressively larger visual transitions.
Additional radius sweep examples.
Appendix Figure. Additional radius-sweep examples show the same behavior across more transitions.

Citation

@misc{jeong2026polar,
  title         = {PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning},
  author        = {Youngjoon Jeong and Jihwan Yu and Minsoo Jo and Junha Chun and Taesup Kim},
  year          = {2026},
  eprint        = {2606.21139},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.21139}
}