Lift3D-VLA
Lifting VLA Models to 3D Geometry and Dynamics-Aware Manipulation

Anonymous Authors

1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University   2AI² Robotics   3The Chinese University of Hong Kong

Peking University Zhiping Fang CUHK
TL;DR: Lift3D-VLA achieves significant improvements over prior VLA methods by equipping 2D VLA models with explicit 3D point cloud reasoning and temporally coherent action generation.
Lift3D-VLA Overview
[Teaser figure: static/images/lift3dvla/1_teaser.png]

Overview. Unlike previous 3D VLA methods that encode point clouds with newly introduced 3D encoders or by projecting features between 2D and 3D spaces, Lift3D-VLA equips 2D VLA models with explicit 3D reasoning and temporally coherent action generation. Across 22 simulated and 8 real-world tasks, Lift3D-VLA achieves state-of-the-art results.

Abstract

Recently, Vision–Language–Action (VLA) models have demonstrated strong generalization across diverse tasks. However, effective robotic manipulation in physical environments fundamentally requires geometric understanding and spatial reasoning. While some VLA approaches attempt to incorporate 3D information, they are constrained by limited data availability and geometric information loss in current 3D encoding pipelines, and fail to jointly capture 3D geometry and temporally structured actions in dynamic environments. To address these limitations, we introduce Lift3D-VLA, a unified VLA framework that equips models with explicit 3D point cloud reasoning and enables temporally coherent action generation. First, building upon our previous work Lift3D, an enhanced 2D model-lifting strategy is proposed to geometrically align 3D points with pretrained 2D positional embeddings, enabling direct point-cloud encoding within the VLA vision encoder while minimizing spatial information loss. Based on explicit 3D inputs, we propose Geometry-Centric Masked Autoencoding (GC-MAE), a dual-objective self-supervised framework that reconstructs the current point cloud while predicting its future geometric evolution, allowing the 2D vision encoder to internalize both 3D structure and physical dynamics. We further design layer-wise temporal action modeling, which leverages multiple layers of the LLM to collaboratively predict action chunks, enabling temporally consistent predictions.

Method: Lift3D-VLA Framework

Lift3D-VLA Framework
[Method figure: static/images/lift3dvla/2_method.png]

Lift3D-VLA Framework. (a) Following Lift3D, we perform virtual projection to align 3D points with pretrained 2D positional embeddings, constructing geometry-aligned 3D PEs. (b) Stage 1: GC-MAE reconstructs the current point cloud while predicting its future geometric evolution, enabling the model to capture physical dynamics. (c) Stage 2: Layer-wise temporal action modeling leverages intermediate and deep LLM layers to generate temporally consistent action sequences.

Simulation Results

Lift3D-VLA is evaluated on 22 tasks across the MetaWorld and RLBench benchmarks in both single-task and multi-task settings.

MetaWorld — Single-Task

MetaWorld — Multi-Task

RLBench — Multi-Task

Real-World Manipulation

We compare Lift3D-VLA against state-of-the-art VLA methods across 8 real-world manipulation tasks.

Models Wipe whiteboard Place dish on rack Place egg on bread Pick & place banana Pour water into cup Stack cola cans Scoop popcorn Open pot pick corn Mean S.R.
SpatialVLA 603320408733274043
π0.5 606047878766536065
CoT-VLA 536633534733335346
Lift3D-VLA 666666879366477371

Lift3D-VLA successfully handles diverse tasks, including dynamic scenarios requiring continuous adaptation to environmental changes.

All videos sped up 10×.

Pick Banana

Pour Water

Stack Cola

We deploy Lift3D-VLA on single-arm and bimanual Franka Research 3 robots, using Intel RealSense D455 RGB-D cameras.

Single and dual-arm Real-world Setup
Single-arm and dual-arm Manipulation Setup

Generalization Experiments

Lift3D-VLA generalizes robustly to unseen objects, backgrounds, and lighting conditions across multiple tasks.

All videos sped up 10×.

Unseen Background

Unseen Object

Unseen Lighting

Pick Banana

Pour Water

Ablation Studies

All experiments are conducted on MetaWorld with success rates reported as percentages.

a) Component Ablations

Component-wise analysis comparing different pretraining strategies.

b) Mask Ratio

Effect of mask ratio during masked point reconstruction.

c) Decoder Depth

Impact of decoder depth on pretraining effectiveness.

d) Pretraining Data Scale

Scalability analysis with varying pretraining data scales. Performance improves consistently with more data.

BibTeX

@article{lift3dvla2024,
  title     = {Lift3D-VLA: Lifting VLA Models to 3D Geometry and Dynamics-Aware Manipulation},
  author    = {Anonymous Authors},
  journal   = {Under Review},
  year      = {2026},
}