Adaptive Vision-Language Model (VLM) and Vision-Language-Action Model (VLA) Enhanced Offroad Autonomy for Heterogeneous Multi-Agent Systems

Project Team

Principal Investigator

Yue Wang, Clemson University

Government

Jonathon Smereka, U.S. Army GVSC

Industry

Huanfei Zheng, Movensys Corporation

Student

Shahil Shaik, Aditya Parameshwaran, Anshul Nayak, Clemson University

Project Summary

Project #5.27 begins 2026.

Future Army missions demand seamless coordination and rapid adaptation among heterogeneous teams of autonomous vehicles that have limited sensing, communication, and computational capabilities. However, current multi-agent autonomy methods often struggle to adapt policies across varied terrains and mission profiles or to integrate real-time contextual inputs from humans. The emergence of large language models (LLMs) and their multimodal descendants—vision–language models (VLMs) and vision-language-action models (VLAs)—holds the promise of driving a paradigm shift in robotics. These models exhibit strong automatic memorization and impressive out-of-domain zero-shot performance that extends beyond narrowly defined tasks, indicating significant potential utility for Army operations.

The objectives of this research are to develop adaptive offroad autonomy for teams of heterogeneous autonomous vehicles, and to push the capabilities of VLM and VLA technologies toward enhanced multi-agent offroad autonomy for practical, high-impact Army applications. To realize these objectives, we seek to answer three fundamental research questions. Q1: How to develop VLM-enhanced MARL to bring advanced vision-language knowledge into heterogeneous multi-vehicle cooperation? Q2: How to design VLM- and VLA-enhanced MARL to enable adaptive, distributed coordination of multiple autonomous vehicles? Q3: What strategies can ensure robust operation of the multi-agent system under potential performance degradation?

The main innovation in basic research lies in the transformation of foundation models to unlock capabilities for autonomous vehicle offroad planning and control with heterogeneous multi-agent teams. We will address three key research innovations in this project:

Creation of a large-scale, multi-modal dataset encompassing diverse collaborative offroad tasks involving heterogeneous multi-robot systems for VLM/VLA training;
Development of vision-language critic models (VLCMs) as a generalized value critic capable of evaluating team behaviors across visual, graph-structured, and linguistic modalities for sample-efficient policy learning; and
Development of multi-modal Vision-Language-Action models (VLAs) for distributed, adaptive MARL across diverse scenarios, including potential system degradation.

5.27