Episodic Memory Question Answering
- URL: http://arxiv.org/abs/2205.01652v1
- Date: Tue, 3 May 2022 17:28:43 GMT
- Title: Episodic Memory Question Answering
- Authors: Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul
Khanna, Dhruv Batra, Devi Parikh
- Abstract summary: We envision a scenario wherein the human communicates with an AI agent powering an augmented reality device by asking questions.
In order to succeed, the ego AI assistant must construct semantically rich and efficient scene memories.
We introduce a new task - Episodic Memory Question Answering (EMQA)
We show that our choice of episodic scene memory outperforms naive, off-the-centric solutions for the task as well as a host of very competitive baselines.
- Score: 55.83870351196461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric augmented reality devices such as wearable glasses passively
capture visual data as a human wearer tours a home environment. We envision a
scenario wherein the human communicates with an AI agent powering such a device
by asking questions (e.g., where did you last see my keys?). In order to
succeed at this task, the egocentric AI assistant must (1) construct
semantically rich and efficient scene memories that encode spatio-temporal
information about objects seen during the tour and (2) possess the ability to
understand the question and ground its answer into the semantic memory
representation. Towards that end, we introduce (1) a new task - Episodic Memory
Question Answering (EMQA) wherein an egocentric AI assistant is provided with a
video sequence (the tour) and a question as an input and is asked to localize
its answer to the question within the tour, (2) a dataset of grounded questions
designed to probe the agent's spatio-temporal understanding of the tour, and
(3) a model for the task that encodes the scene as an allocentric, top-down
semantic feature map and grounds the question into the map to localize the
answer. We show that our choice of episodic scene memory outperforms naive,
off-the-shelf solutions for the task as well as a host of very competitive
baselines and is robust to noise in depth, pose as well as camera jitter. The
project page can be found at: https://samyak-268.github.io/emqa .
Related papers
- Explore until Confident: Efficient Exploration for Embodied Question Answering [32.27111287314288]
We leverage the strong semantic reasoning capabilities of large vision-language models to efficiently explore and answer questions.
We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM.
Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration.
arXiv Detail & Related papers (2024-03-23T22:04:03Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Equivariant and Invariant Grounding for Video Question Answering [68.33688981540998]
Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
arXiv Detail & Related papers (2022-07-26T10:01:02Z) - Explore before Moving: A Feasible Path Estimation and Memory Recalling
Framework for Embodied Navigation [117.26891277593205]
We focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense.
Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling framework.
We show strong experimental results of PEMR on the EmbodiedQA navigation task.
arXiv Detail & Related papers (2021-10-16T13:30:55Z) - Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue [0.0]
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems.
We propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer)
We build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions.
arXiv Detail & Related papers (2021-06-29T16:36:34Z) - Scene-Intuitive Agent for Remote Embodied Visual Grounding [89.73786309180139]
Humans learn from life events to form intuitions towards the understanding of visual environments and languages.
We present an agent that mimics such human behaviors.
arXiv Detail & Related papers (2021-03-24T02:37:48Z) - Semantic MapNet: Building Allocentric Semantic Maps and Representations
from Egocentric Views [50.844459908504476]
We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment.
We build an allo top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose.
We present SemanticMapNet (SMNet), which combines the strengths of projective camera geometry and neural representation learning.
arXiv Detail & Related papers (2020-10-02T20:44:46Z) - Scene Graph Reasoning for Visual Question Answering [23.57543808056452]
We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene.
A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers.
arXiv Detail & Related papers (2020-07-02T13:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.