1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2AI² Robotics 3The Chinese University of Hong Kong
Overview. Unlike previous 3D VLA methods that encode point clouds with newly introduced 3D encoders or by projecting features between 2D and 3D spaces, Lift3D-VLA equips 2D VLA models with explicit 3D reasoning and temporally coherent action generation. Across 22 simulated and 8 real-world tasks, Lift3D-VLA achieves state-of-the-art results.
Recently, Vision–Language–Action (VLA) models have demonstrated strong generalization across diverse tasks. However, effective robotic manipulation in physical environments fundamentally requires geometric understanding and spatial reasoning. While some VLA approaches attempt to incorporate 3D information, they are constrained by limited data availability and geometric information loss in current 3D encoding pipelines, and fail to jointly capture 3D geometry and temporally structured actions in dynamic environments. To address these limitations, we introduce Lift3D-VLA, a unified VLA framework that equips models with explicit 3D point cloud reasoning and enables temporally coherent action generation. First, building upon our previous work Lift3D, an enhanced 2D model-lifting strategy is proposed to geometrically align 3D points with pretrained 2D positional embeddings, enabling direct point-cloud encoding within the VLA vision encoder while minimizing spatial information loss. Based on explicit 3D inputs, we propose Geometry-Centric Masked Autoencoding (GC-MAE), a dual-objective self-supervised framework that reconstructs the current point cloud while predicting its future geometric evolution, allowing the 2D vision encoder to internalize both 3D structure and physical dynamics. We further design layer-wise temporal action modeling, which leverages multiple layers of the LLM to collaboratively predict action chunks, enabling temporally consistent predictions.
Lift3D-VLA Framework. (a) Following Lift3D, we perform virtual projection to align 3D points with pretrained 2D positional embeddings, constructing geometry-aligned 3D PEs. (b) Stage 1: GC-MAE reconstructs the current point cloud while predicting its future geometric evolution, enabling the model to capture physical dynamics. (c) Stage 2: Layer-wise temporal action modeling leverages intermediate and deep LLM layers to generate temporally consistent action sequences.
Lift3D-VLA is evaluated on 22 tasks across the MetaWorld and RLBench benchmarks in both single-task and multi-task settings.
We compare Lift3D-VLA against state-of-the-art VLA methods across 8 real-world manipulation tasks.
| Models | Wipe whiteboard | Place dish on rack | Place egg on bread | Pick & place banana | Pour water into cup | Stack cola cans | Scoop popcorn | Open pot pick corn | Mean S.R. |
|---|---|---|---|---|---|---|---|---|---|
| SpatialVLA | 60 | 33 | 20 | 40 | 87 | 33 | 27 | 40 | 43 |
| π0.5 | 60 | 60 | 47 | 87 | 87 | 66 | 53 | 60 | 65 |
| CoT-VLA | 53 | 66 | 33 | 53 | 47 | 33 | 33 | 53 | 46 |
| Lift3D-VLA | 66 | 66 | 66 | 87 | 93 | 66 | 47 | 73 | 71 |
Lift3D-VLA successfully handles diverse tasks, including dynamic scenarios requiring continuous adaptation to environmental changes.
All videos sped up 10×.
Pick Banana
Pour Water
Stack Cola
We deploy Lift3D-VLA on single-arm and bimanual Franka Research 3 robots, using Intel RealSense D455 RGB-D cameras.
Lift3D-VLA generalizes robustly to unseen objects, backgrounds, and lighting conditions across multiple tasks.
All videos sped up 10×.
Unseen Background
Unseen Object
Unseen Lighting
Pick Banana
Pour Water
All experiments are conducted on MetaWorld with success rates reported as percentages.
Component-wise analysis comparing different pretraining strategies.
Effect of mask ratio during masked point reconstruction.
Impact of decoder depth on pretraining effectiveness.
Scalability analysis with varying pretraining data scales. Performance improves consistently with more data.
@article{lift3dvla2024,
title = {Lift3D-VLA: Lifting VLA Models to 3D Geometry and Dynamics-Aware Manipulation},
author = {Anonymous Authors},
journal = {Under Review},
year = {2026},
}