Lift3D-VLA
Lifting VLA Models to 3D Geometry and Dynamics-Aware Manipulation

Equal contribution    Corresponding author

1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University   2AI² Robotics   3The Chinese University of Hong Kong

Peking University Zhiping Fang CUHK
TL;DR: Lift3D-VLA achieves significant improvements over prior VLA methods by equipping 2D VLA models with explicit 3D point cloud reasoning and temporally coherent action generation.
Lift3D-VLA Overview
[Teaser figure: static/images/lift3dvla/1_teaser.png]

Overview. Unlike previous 3D VLA methods that encode point clouds with newly introduced 3D encoders or by projecting features between 2D and 3D spaces, Lift3D-VLA equips 2D VLA models with explicit 3D reasoning and temporally coherent action generation. Across 22 simulated and 8 real-world tasks, Lift3D-VLA achieves state-of-the-art results.

Abstract

Recently, Vision–Language–Action (VLA) models have demonstrated strong generalization across diverse tasks. However, effective robotic manipulation in physical environments fundamentally requires geometric understanding and spatial reasoning. While some VLA approaches attempt to incorporate 3D information, they are constrained by limited data availability and geometric information loss in current 3D encoding pipelines, and fail to jointly capture 3D geometry and temporally structured actions in dynamic environments. To address these limitations, we introduce Lift3D-VLA, a unified VLA framework that equips models with explicit 3D point cloud reasoning and enables temporally coherent action generation. First, building upon our previous work Lift3D, an enhanced 2D model-lifting strategy is proposed to geometrically align 3D points with pretrained 2D positional embeddings, enabling direct point-cloud encoding within the VLA vision encoder while minimizing spatial information loss. Based on explicit 3D inputs, we propose Geometry-Centric Masked Autoencoding (GC-MAE), a dual-objective self-supervised framework that reconstructs the current point cloud while predicting its future geometric evolution, allowing the 2D vision encoder to internalize both 3D structure and physical dynamics. We further design layer-wise temporal action modeling, which leverages multiple layers of the LLM to collaboratively predict action chunks, enabling temporally consistent predictions.

Method: Lift3D-VLA Framework

Lift3D-VLA Framework
[Method figure: static/images/lift3dvla/2_method.png]

Lift3D-VLA Framework. (a) Following Lift3D, we perform virtual projection to align 3D points with pretrained 2D positional embeddings, constructing geometry-aligned 3D PEs. (b) Stage 1: GC-MAE reconstructs the current point cloud while predicting its future geometric evolution, enabling the model to capture physical dynamics. (c) Stage 2: Layer-wise temporal action modeling leverages intermediate and deep LLM layers to generate temporally consistent action sequences.

Simulation Results

Lift3D-VLA is evaluated on 22 tasks across the MetaWorld and RLBench benchmarks in both single-task and multi-task settings.

MetaWorld — Single-Task

MetaWorld — Multi-Task

RLBench — Multi-Task

Real-World Manipulation

We compare Lift3D-VLA against state-of-the-art VLA methods across 8 real-world manipulation tasks.

Models Wipe whiteboard Place dish on rack Place egg on bread Pick & place banana Pour water into cup Stack cola cans Scoop popcorn Open pot pick corn Mean S.R.
SpatialVLA 603320408733274043
π0.5 606047878766536065
CoT-VLA 536633534733335346
Lift3D-VLA 666666879366477371

Lift3D-VLA successfully handles diverse tasks, including dynamic scenarios requiring continuous adaptation to environmental changes.

All videos sped up 10×.

Pick Banana

Pour Water

Stack Cola

We deploy Lift3D-VLA on Franka Research 3 robots, using Intel RealSense D455 RGB-D cameras and 3D-Printed UMI Grippers.

Real-world Setup
Real-world Manipulation Setup

Generalization Experiments

Lift3D-VLA generalizes robustly to unseen objects, backgrounds, and lighting conditions across multiple tasks.

All videos sped up 10×.

Unseen Background

Unseen Object

Unseen Lighting

Pick Banana

Pour Water

BibTeX

@article{lift3dvla2024,
  title     = {Lift3D-VLA: Lifting VLA Models to 3D Geometry and Dynamics-Aware Manipulation},
  author    = {Jiaming Liu and Qingpo Wuwu and Nuowei Han and Hao Chen and Zhuoyang Liu and Fan Fei and Yueru Jia and Chenyang Gu and Yandong Guo and Boxin Shi and Shanghang Zhang},
  journal   = {Under Review},
  year      = {2026},
}