Graph-Enhanced Vision-Language Sensor Fusion for Robust Perception in Data-Scarce and Ambiguous Off-Road Driving Scenarios

Project Team

Principal Investigator

Mohammad Al Faruque, University of California, Irvine Pramod Khargonekar, University of California, Irvine

Government

Jonathon Smereka, U.S. Army GVSC

Industry

Marco Levorato, Aurelius Industries L.L.C

Student

Junyao Wang, University of California Irvine

Project Summary

Project #1.46 begins 2026.

The brittleness of current state-of-the-art perception methods remains a central obstacle to deploying trustworthy autonomous systems in contested, uncertain, and dynamic mission environments. Despite strong benchmark performance, these models often fail under sensor failures, adverse weather, or distributional shifts, undermining reliability in safety-critical missions. To address these limitations, our research aims to establish a new scientific foundation for sensor fusion by integrating structured spatial reasoning with high-level semantic guidance. Our goal is to design and validate a unified perception framework that can tolerate sensor noise and failures, generalize from limited data, operate robustly in unstructured off-road environments, and support scalable multi-agent fusion under realistic communication constraints. We pursue this vision through four fundamental research questions (RQs)：

RQ1: Robustness under edge conditions. How can region-level graph representations improve fusion resilience when sensors fail or degrade in adverse conditions? We hypothesize that modeling the environment as a graph, with Graph Neural Networks propagating information from reliable to corrupted regions, will enable graceful degradation rather than catastrophic collapse, improving perception accuracy under diverse degradation scenarios.

RQ2: Semantic guidance for long-tail generalization. How can prompt-driven Vision-Language Models (VLMs) provide adaptive and interpretable guidance to handle ambiguity and recognize long-tail events with minimal supervision? By leveraging pre-trained VLMs and natural language prompts (e.g., “watch for construction vehicles”), we will generate semantic attention signals that guide the fusion process in real time, enabling few-shot and zero-shot recognition of novel objects and scenarios.

RQ3: Off-road perception and navigation. How can fused graph-semantic representations support robust navigation in complex, unstructured terrains that lack the strong priors of urban driving? We hypothesize that combining 3D topological modeling of terrain with semantic classification from VLMs (e.g., “muddy patch” vs. “gravel”) will enable resilient traversability analysis and obstacle avoidance in environments where conventional models fail.

RQ4: Collective perception across agents. How can semantic communication protocols extend the framework to multi-agent systems, achieving collective situational awareness under strict bandwidth limits? We propose transmitting compact symbolic messages distilled by onboard VLMs (e.g., object states, confidence scores, hazard alerts), rather than raw sensor data. We hypothesize that such knowledge-level communication will achieve comparable accuracy to raw data fusion while drastically reducing bandwidth requirements, enabling scalable, practical multi-agent collaboration.

#1.46