Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation

1Shanghai AI Laboratory, 2The University of Hong Kong, 3Zhejiang University, 4Tsinghua University,
Project Lead, Corresponding Author

TL; DR

We propose DualVLN, a dual-system foundation model for Vision-Language Navigation, which includes:

  • System 2: a large foundation VLM, performs slow but robust reasoning and produces explicit pixel goals.
  • System 1: a lightweight diffusion policy, generates smooth and safe trajectories in real time.

Abstract

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, “grounds slowly” by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, “moves fast” by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Approach

Dual-System Design for VLN

overview Overview of the DualVLN framework.

DualVLN adopts a compositional architecture featuring a dual-system design that synergistically combines high-level instruction interpretation with low-level action execution. Specifically, DualVLN decouples the VLN pipeline into two complementary systems:

  • System 2: A vision-language model (VLM)-based planning module that interprets navigation instructions to predict mid-term waypoint goals through image-grounded reasoning. By predicting pixel coordinates in the image space, it effectively connects instruction understanding with spatial reasoning, enabling long-horizon navigation instruction following.
  • System 1: A multi-modal goal-conditioned diffusion policy guided by latent plan or supported explicit goals, which generates executable short-horizon trajectories conditioned on current observations and the asynchronous latent features from System 2. It enables robust, real-time control and local decision-making in complex environments.

Evaluation in Simulation

Quantitative Results on VLN-CE Benchmark

method
Evaluation results on R2R-CE and RxR-CE benchmarks.

we compare DualVLN under the VLN-CE evaluation against three representative categories of baselines:

  • Multi-sensor methods methods that incorporate panoramic RGB, odometry, and depth (e.g., HPN+DN, CMA, GridMM, ETPNav);
  • VLM-free methods trained on single first-person RGB and depth (e.g., CM2, LAW, WS-MGMap);
  • Video-LLM based methods relying solely on single-view RGB (e.g., NaVid, MapNav, NaVILA, UniNaVid, StreamVLN).

With only first-person RGB inputs, DualVLN achieves substantial gains over all prior RGB-based approaches, highlighting the strength of our dual-system design.



Quantitative Results on VLN-PE Benchmark

method
Evaluation results on VLN-PE benchmark.

Despite not being fine-tuned on VLN-PE trajectories, DualVLN surpasses all baselines, including those trained on VLN-PE and VLM-based methods.



Social-VLN Benchmark

method
Overview of the Social-VLN benchmark.

method
Comparison of DualVLN and StreamVLN on standard R2R VLN and Social-VLN.

We also introduce the first Social-VLN benchmark to evaluate navigation models on social awareness and task recovery in dynamic environments, where humanoid agents are placed along task trajectories.

We evaluate DualVLN and StreamVLN on the Social-VLN benchmark. StreamVLN is selected as the baseline due to its low action latency, which allows it to react to dynamic obstacles to some extent. As shown in table above, both methods experience substantial performance drops — e.g., the success rate of DualVLN decreases by about 27% and that of StreamVLN by 26% compared to their results on standard VLN tasks — highlighting the increased difficulty of Social-VLN setting.


VLN Evaluation Examples

VLN-CE Evaluation in Habitat Simulator. Green curves are GT.

VLN-PE Evaluation with physics simulation on a Unitree H1 robot.

Real-World In-the-Wild Testing

Baseline Comparison

NaVILA

DualVLN

StreamVLN

Interpolate start reference image. Real-World evaluation results.


Zero-Shot Transfer to Long-Horizon Navigation

Indoor Complex Instruction Following

Outdoor Autonomous Exploration


Clutter Environment Collision Avoidance

Instruction: Walk through the tables and chairs, turn left into the café, and stop at the coffee counter.

Instruction: Walk through the bamboo forest and turn left. Reach a man sitting on the sofa.


Semantics Understanding

Instruction: Walk straight ahead, turn right after seeing the billiard table, and enter the room with shelves.

Instruction:Walk out of this house and find a trash bin.

Instruction: Enter the office area. Immediately turn left and continue straight. Then make a right turn. Stop at the workstation near the whiteboard.

Instruction: Walk towards the orange coffee sculpture and go upstairs. Then keep walking straight and turn right at the end. After that turn left and walking to the man with a black umbrella . Stop in front of the doors with red handles.


BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}