In-the-wild Question Answering: Toward Natural Human-Autonomy Interaction

Project Team

Principal Investigator

Rada Mihalcea, University of Michigan

Government

Matt Castanier, U.S. Army GVSC

Faculty

Mihai Burzo, University of Michigan

Industry

Glenn Taylor, Soar Technology, Inc.

Student

Santiago Castro, U. of Michigan

Project Summary

Project #2.15 began 2021 and was completed 2023.

Current autonomous vehicles are able to explore large and unchartered spaces in a short amount of time, however they are not able to “report back” the information they collected in a manner that is easily accessible to human users and does not produce information overload. As an example, consider the situation of a manned vehicle followed by several autonomous vehicles. In order to maintain situational awareness for the entire fleet, the human driver in the lead vehicle needs to communicate with the autonomous vehicles in ways that are similar to human-human communication.

Visual semantic graph construction

The main research question we are addressing is how to effectively and efficiently perform natural question answering against a large visual data stream. We are specifically targeting in-the-wild question answering, where the visual data stream is a close representation of real world settings reflecting the challenges of complex and dynamic environments.

The project targets three main research objectives: (1) Construct a large dataset of video recordings paired with natural language questions that are representative for in-the-wild complex environments. This dataset will be used to both train and test in-the-wild multimodal question answering systems that can be used to enhance situational awareness. (2) Develop visual representation algorithms that convert the visual streams into semantic graphs that capture the entities in the videos as well as their relations. (3) Develop algorithms for multimodal question answering that aim to understand the type and intent of the questions, and map them against the semantic graph representations of the visual streams to identify one or more candidate answers.

Our project makes new contributions by developing novel multimodal question answering algorithms relying on semantic graph representations, and addressing real and challenging settings. Most of the algorithms that have been previously proposed for multimodal question answering have been developed and tested on data drawn from movies and TV series, which consist of acted, scripted, well-directed and heavily edited video clips that are hard to find in the real world. In contrast, our project has to overcome the challenge of environmental noise, low lighting conditions, scenes that are less defined and not perfectly framed, and lack of subject permanence.

Other Publications:

Oana Ignat, Towards Human Action Understanding in Social Media Videos Using Multimodal Models, PhD Dissertation, August 2022.

Software:

In-the-wild Data
https://github.com/MichiganNLP/In-the-wild-QA/tree/main/src/example_data/wildQA-data
In-the-wild Question Answering benchmark and models
https://github.com/MichiganNLP/In-the-wild-QA
Video fill-in-the-blanks question answering: benchmark and models
https://github.com/MichiganNLP/vlog_action_recognition
Scalable probing of language-vision models
https://github.com/MichiganNLP/Scalable-VLM-Probing
Graph semantic representations for action recognition
https://github.com/MichiganNLP/video-fill-in-the-blank

* The project benefits from the unfunded contributions of Oana Ignat, who is a postdoctoral fellow.

2.15

Publications:

Oana Ignat, Santiago Castro, Hanwen Miao, Weiji Li, and Rada Mihalcea. 2021. WhyAct: Identifying Action Reasons in Lifestyle Vlogs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4770–4785, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. (https://arxiv.org/abs/2109.02747)
Castro, S., Wang, R., Huang, P., Stewart, I., Ignat, O., Liu, N., … & Mihalcea, R. (2022, May). FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2925-2940).
Ignat, O., Castro, S., Li, W., & Mihalcea, R. (2024, August). Learning Human Action Representations from Temporal Context in Lifestyle Vlogs. In Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing (pp. 1-18).
Castro, S., Deng, N., Huang, P., Burzo, M., & Mihalcea, R. (2022, October). In-the-Wild Video Question Answering. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 5613-5635).
Castro, S., Ignat, O., & Mihalcea, R. (2023, July). Scalable Performance Analysis for Vision-Language Models. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (* SEM 2023) (pp. 284-294).