AssistQ: Affordance-centric Question-driven Task Completion for
Egocentric Assistant
- URL: http://arxiv.org/abs/2203.04203v1
- Date: Tue, 8 Mar 2022 17:07:09 GMT
- Title: AssistQ: Affordance-centric Question-driven Task Completion for
Egocentric Assistant
- Authors: Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei
Gao, Mike Zheng Shou
- Abstract summary: We define a new task called Affordance-centric Question-driven Task Completion.
The AI assistant should learn from instructional videos and scripts to guide the user step-by-step.
To support the task, we constructed AssistQ, a new dataset comprising 529 question-answer samples.
- Score: 6.379158555341729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A long-standing goal of intelligent assistants such as AR glasses/robots has
been to assist users in affordance-centric real-world scenarios, such as "how
can I run the microwave for 1 minute?". However, there is still no clear task
definition and suitable benchmarks. In this paper, we define a new task called
Affordance-centric Question-driven Task Completion, where the AI assistant
should learn from instructional videos and scripts to guide the user
step-by-step. To support the task, we constructed AssistQ, a new dataset
comprising 529 question-answer samples derived from 100 newly filmed
first-person videos. Each question should be completed with multi-step
guidances by inferring from visual details (e.g., buttons' position) and
textural details (e.g., actions like press/turn). To address this unique task,
we developed a Question-to-Actions (Q2A) model that significantly outperforms
several baseline methods while still having large room for improvement. We
expect our task and dataset to advance Egocentric AI Assistant's development.
Our project page is available at: https://showlab.github.io/assistq
Related papers
- HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI
Assistants in the Real World [48.90399899928823]
This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world.
We introduce HoloAssist, a large-scale egocentric human interaction dataset.
We present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment.
arXiv Detail & Related papers (2023-09-29T07:17:43Z) - A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step
Inference [51.26551806938455]
Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario.
We present a solution for enhancing video alignment to improve multi-step inference.
Our method secured the 2nd place in CVPR'2023 AQTC challenge.
arXiv Detail & Related papers (2023-06-26T04:19:33Z) - TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with
Millions of APIs [71.7495056818522]
We introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion.
We will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
arXiv Detail & Related papers (2023-03-29T03:30:38Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Winning the CVPR'2022 AQTC Challenge: A Two-stage Function-centric
Approach [51.424201533529114]
Affordance-centric Question-driven Task Completion for Egocentric Assistant(AQTC) is a novel task which helps AI assistant learn from instructional videos and scripts and guide the user step-by-step.
We deal with the AQTC via a two-stage Function-centric approach, which consists of Question2Function Module to ground the question with the related function and Function2Answer Module to predict the action based on the historical steps.
arXiv Detail & Related papers (2022-06-20T07:02:23Z) - AssistSR: Affordance-centric Question-driven Video Segment Retrieval [4.047098915826058]
Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
We present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
arXiv Detail & Related papers (2021-11-30T01:14:10Z) - Meta-learning for Few-shot Natural Language Processing: A Survey [10.396506243272158]
Few-shot natural language processing (NLP) refers to NLP tasks that are accompanied with merely a handful of labeled examples.
This paper focuses on NLP domain, especially few-shot applications.
We try to provide clearer definitions, progress summary and some common datasets of applying meta-learning to few-shot NLP.
arXiv Detail & Related papers (2020-07-19T06:36:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.