Lift3D-VLA: Lifting VLA Models to 3D Geometry and Dynamics-Aware Manipulation

Overview. Unlike previous 3D VLA methods that encode point clouds with newly introduced 3D encoders or by projecting features between 2D and 3D spaces, Lift3D-VLA equips 2D VLA models with explicit 3D reasoning and temporally coherent action generation. Across 22 simulated and 8 real-world tasks, Lift3D-VLA achieves state-of-the-art results.

Abstract

Recently, Vision–Language–Action (VLA) models have demonstrated strong generalization across diverse tasks. However, effective robotic manipulation in physical environments fundamentally requires geometric understanding and spatial reasoning. While some VLA approaches attempt to incorporate 3D information, they are constrained by limited data availability and geometric information loss in current 3D encoding pipelines, and fail to jointly capture 3D geometry and temporally structured actions in dynamic environments. To address these limitations, we introduce Lift3D-VLA, a unified VLA framework that equips models with explicit 3D point cloud reasoning and enables temporally coherent action generation. First, building upon our previous work Lift3D, an enhanced 2D model-lifting strategy is proposed to geometrically align 3D points with pretrained 2D positional embeddings, enabling direct point-cloud encoding within the VLA vision encoder while minimizing spatial information loss. Based on explicit 3D inputs, we propose Geometry-Centric Masked Autoencoding (GC-MAE), a dual-objective self-supervised framework that reconstructs the current point cloud while predicting its future geometric evolution, allowing the 2D vision encoder to internalize both 3D structure and physical dynamics. We further design layer-wise temporal action modeling, which leverages multiple layers of the LLM to collaboratively predict action chunks, enabling temporally consistent predictions.

Method: Lift3D-VLA Framework

Lift3D-VLA Framework. (a) Following Lift3D, we perform virtual projection to align 3D points with pretrained 2D positional embeddings, constructing geometry-aligned 3D PEs. (b) Stage 1: GC-MAE reconstructs the current point cloud while predicting its future geometric evolution, enabling the model to capture physical dynamics. (c) Stage 2: Layer-wise temporal action modeling leverages intermediate and deep LLM layers to generate temporally consistent action sequences.

Simulation Results

Lift3D-VLA is evaluated on 22 tasks across the MetaWorld and RLBench benchmarks in both single-task and multi-task settings.

MetaWorld — Single-Task

MetaWorld — Multi-Task

RLBench — Multi-Task

Real-World Manipulation

We compare Lift3D-VLA against state-of-the-art VLA methods across 8 real-world manipulation tasks.

Models	Wipe whiteboard	Place dish on rack	Place egg on bread	Pick & place banana	Pour water into cup	Stack cola cans	Scoop popcorn	Open pot pick corn	Mean S.R.
SpatialVLA	60	33	20	40	87	33	27	40	43
π_0.5	60	60	47	87	87	66	53	60	65
CoT-VLA	53	66	33	53	47	33	33	53	46
Lift3D-VLA	66	66	66	87	93	66	47	73	71

Lift3D-VLA successfully handles diverse tasks, including dynamic scenarios requiring continuous adaptation to environmental changes.

All videos sped up 10×.

Pick Banana

Pour Water

Stack Cola

We deploy Lift3D-VLA on Franka Research 3 robots, using Intel RealSense D455 RGB-D cameras and 3D-Printed UMI Grippers.

Real-world Setup — Real-world Manipulation Setup

Generalization Experiments

Lift3D-VLA generalizes robustly to unseen objects, backgrounds, and lighting conditions across multiple tasks.

All videos sped up 10×.

Unseen Background

Unseen Object

Unseen Lighting