Skip to main content
arc logo
Automotive Research Center
hero image
Back to all projects

Human-Autonomy Interaction

Annual Plan

In-the-wild Question Answering: Toward Natural Human-Autonomy Interaction

Project Team

Principal Investigator

Rada Mihalcea, University of Michigan

Government

Matt Castanier, U.S. Army GVSC

Faculty

Mihai Burzo, University of Michigan

Industry

Glenn Taylor, Soar Technology, Inc.

Student

TBD

Project Summary

Project begins 2021.

Situational awareness remains a major goal for both humans and machines as they interact with complex and dynamic environments. Awareness to unfolding situations allows for rapid reactions to events that take place in the surrounding environment, and enables more informed and thus better decision making. Conversely, a lack of situational awareness is associated with decision errors, which in turn can lead to critical incidents. As we continue to make advances in the development and deployment of autonomous systems, a central question that needs to be addressed is how to equip such systems with the ability to acquire and maintain situational awareness, in ways that match and complement the human ability.

Current autonomous vehicles are able to explore large and unchartered spaces in a short amount of time, however they are not able to “report back” the information they collected in a manner that is easily accessible to human users and does not produce information overload. As an example, consider the situation of a manned vehicle followed by several autonomous vehicles. In order to maintain situational awareness for the entire fleet, the human driver in the lead vehicle needs to communicate with the autonomous vehicles in ways that are similar to human-human communication.

Visual semantic graph construction

The main research question we are addressing is how to effectively and efficiently perform natural question answering against a large visual data stream. We are specifically targeting in-the-wild question answering, where the visual data stream is a close representation of real world settings reflecting the challenges of complex and dynamic environments.

The project targets three main research objectives: (1) Construct a large dataset of video recordings paired with natural language questions that are representative for in-the-wild complex environments. This dataset will be used to both train and test in-the-wild multimodal question answering systems that can be used to enhance situational awareness. (2) Develop visual representation algorithms that convert the visual streams into semantic graphs that capture the entities in the videos as well as their relations. (3) Develop algorithms for multimodal question answering that aim to understand the type and intent of the questions, and map them against the semantic graph representations of the visual streams to identify one or more candidate answers.

Our project will make new contributions by developing novel multimodal question answering algorithms relying on semantic graph representations, and addressing real and challenging settings. Most of the algorithms that have been previously proposed for multimodal question answering have been developed and tested on data drawn from movies and TV series, which consist of acted, scripted, well-directed and heavily edited video clips that are hard to find in the real world. In contrast, our project will have to overcome the challenge of environmental noise, low lighting conditions, scenes that are less defined and not perfectly framed, and lack of subject permanence.


#2.15