Vision Language Models (VLMs) pretrained on Internetscale vision-language data have demonstrated the potential to transfer their knowledge to robotic learning. However, the existing paradigm encounters three critical challenges: (1) expensive inference cost resulting from largescale model parameters, (2) frequent domain shifts caused by mismatched data modalities, and (3) limited capacity to handle past or future experiences. In this work, we propose LiteVLP, a lightweight, memory-based, and general-purpose vision-language policy generation model. LiteVLP is built upon a pre-trained 1B-parameter VLM and fine-tuned on a tiny-scale and conversation-style robotic dataset. Through extensive experiments, we demonstrate that LiteVLP outperforms state-of-the-art vision-language policy on VIMA-Bench, with minimal training time. Furthermore, LiteVLP exhibits superior inference speed while maintaining exceptional high accuracy. In long-horizon manipulation tasks, LiteVLP also shows remarkable memory ability, outperforming the best-performing baseline model by 18.8%. These results highlight LiteVLP as a promising model to integrating the intelligence of VLMs into robotic learning.
Integrating pre-trained Large Language Models (LLMs) and VLMs with low-level robotic policies has made significant progress. However, there are three key challenges in existing models. First, fine-tuning pre-trained LLMs and VLMs with robotic data often encounters frequent domin shifts due to the substantial differences between the pre-training web dataset and the fine-tuning robotic dataset. Second, current models suffer from insufficient memory for future or past experiences, making them achieve poor performance for long-horizon manipulation. Third, the large number of model parameters in the backbone network leads to a high computational time, which limits their real-world deployment.
The LiteVLP framework initiates with multi-observation compression and then projects the image features into the same dimensional space as the text features. Subsequently, the image tokens are interleaved with text tokens and processed by a large language model to generate a text output that includes the end-effector's action. Of note, during the fine-tuning stage, the parameters of ViT are frozen, while the length embedding, the MLP projector and the large language model are trained.
We evaluate our method on VIMA-Bench and compare its performance against other models trained on datasets of different sizes. The results are shown in the following table. We demonstrate that LiteVLP achieves highly competitive performance, achieving a success rate of 84.2% on L1, 78.1% on L2, and 75.4% on L3, using only 1.2% of the VIMA dataset. This performance is comparable to the state-of-the-art model VIMA, which achieves 81.5% on L1, 81.5% on L2, and 78.7% on L3 when trained on the full dataset. Additionally, our model significantly outperforms other VLMs fine-tuned on our small dataset, such as LLaRA-7B and Mipha-3B. These results successfully indicate that LiteVLP can rapidly adapt to robotic manipulation and demonstrate highly competitive performance when fine-tuned with visuomotor instruction in a small robotic dataset. Note that the suffix '-m' in model names denotes multi-image input, while '-s' indicates single-image input.
In long horizon manipulation tasks, LiteVLP-m significantly outperforms other baseline models. We refer to CoTDiffusion to select three representative long-horizon manipulation tasks in VIMA-Bench —visual rearrangement, visual reasoning, and visual constraints. CoTDiffusion is a model specifically designed to improve performance in long-horizon manipulation tasks. However, our LiteVLP-m demonstrates superior performance in long-horizon manipulation tasks, achieving an average improvement of 18.8% over CoTDiffusion in three types of tasks. The effectiveness of our model can be attributed to the sufficient memory for past and future experiences, making it more capable of long-horizon manipulation.
Our method demonstrate a significant advantage on training efficiency. Specifically, we achieve an average success rate of 80.7% in just 6.1 hours of training, using 4 NVIDIA RTX 3090 GPUs. In comparison, VIMA, which is trained on 8 NVIDIA V100 GPUs, takes 24 hours and achieves an average success rate of 80.6%, while LLaRA-7B, trained on 4 NVIDIA RTX 3090 GPUs, requires 21 hours and achieves 74% on average. These results highlight the efficiency of our approach, which not only reduces training time significantly by 17.9 hours compared to VIMA and 14.9 hours compared to LLaRA-7B but also performs excellently even on less powerful GPU setups.
With a lightweight design, our model not only significantly reduces training time, but also accelerates inference speed, demonstrating a huge advantage on low latency. We fairly compare our LiteVLP-m with Mipha-3B and LLaRA-7B on VIMA-Bench tasks, with the same NVIDIA RTX 3090. As shown in the following figure, our LiteVLP-m achieves a superior performance with 6.8 times lower inference latency than LLaRA. This result can be attributed to two main factors: the smaller size of our model, and the effectiveness of our MOC module in shortening the input sequence.
@article{litevlp2025li,
title={Towards Fast, Memory-based and Data-Efficient Vision-Language Policy},
author={Haoxuan Li and Sixu Yan and Yuhan Li and Xinggang Wang},
journal={arXiv preprint arXiv:2503.10322},
year={2025}
}