Related papers: ObjectNLQ @ Ego4D Episodic Memory Challenge 2024

ObjectNLQ @ Ego4D Episodic Memory Challenge 2024

URL: http://arxiv.org/abs/2406.15778v2
Date: Mon, 18 Nov 2024 03:02:17 GMT
Title: ObjectNLQ @ Ego4D Episodic Memory Challenge 2024
Authors: Yisen Feng, Haoyu Zhang, Yuquan Xie, Zaijing Li, Meng Liu, Liqiang Nie,
Abstract summary: We present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024. Both challenges require the localization of actions within long video sequences using textual queries. We introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information.
Score: 51.57555556405898
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this report, we present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024. Both challenges require the localization of actions within long video sequences using textual queries. To enhance localization accuracy, our method not only processes the temporal information of videos but also identifies fine-grained objects spatially within the frames. To this end, we introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information, thereby improving grounding efficiency. ObjectNLQ achieves a mean R@1 of 23.15, ranking 2nd in the Natural Language Queries Challenge, and gains 33.00 in terms of the metric R@1, IoU=0.3, ranking 3rd in the Goal Step Challenge. Our code will be released at https://github.com/Yisen-Feng/ObjectNLQ.

Related papers

RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations [55.74675012171316]
RELOCATE is a training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training, RELOCATE leverages a region-based representation derived from pretrained vision models.
arXiv Detail & Related papers (2024-12-02T18:59:53Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding. This paper introduces a novel approach to instance segmentation and tracking in first-person video. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z)
Point-VOS: Pointing Up Video Object Segmentation [16.359861197595986]
Current state-of-the-art Video Object (VOS) methods rely on dense per-object mask annotations both during training and testing. We propose a novel Point-VOS task with a sparse-temporally point-wise annotation scheme that substantially reduces the effort. We show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task.
arXiv Detail & Related papers (2024-02-08T18:52:23Z)
Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem. To capture the object-level spatial context, we have developed the Stacked Transformer. The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z)
Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization [119.23191388798921]
This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. We propose a novel transformer-based module that allows for object-proposal set context to be considered.
arXiv Detail & Related papers (2022-11-18T22:50:50Z)
The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames. This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z)
Local-Global Context Aware Transformer for Language-Guided Video Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS) We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z)
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning [41.14313691818424]
We propose an Object-Oriented Non-Autoregressive approach (O2NA) for video captioning. O2NA performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Experiments on two benchmark datasets, MSR-VTT and MSVD, demonstrate the effectiveness of O2NA.
arXiv Detail & Related papers (2021-08-05T04:17:20Z)
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm. A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.