Sketch-based Video Object Localization
- URL: http://arxiv.org/abs/2304.00450v3
- Date: Wed, 29 Nov 2023 06:05:47 GMT
- Title: Sketch-based Video Object Localization
- Authors: Sangmin Woo, So-Yeong Jeon, Jinyoung Park, Minji Son, Sumin Lee,
Changick Kim
- Abstract summary: We build the Sketch-Video Attention Network (SVANet) to bridge the domain gap between sketch and video.
SVANet is equipped with a Cross-modal Transformer that models the interaction between learnable object tokens, query sketch, and video.
By design, SVANet successfully learns the mapping between the query sketches and video objects, achieving state-of-the-art results on the SVOL benchmark.
- Score: 18.75339976049454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Sketch-based Video Object Localization (SVOL), a new task aimed
at localizing spatio-temporal object boxes in video queried by the input
sketch. We first outline the challenges in the SVOL task and build the
Sketch-Video Attention Network (SVANet) with the following design principles:
(i) to consider temporal information of video and bridge the domain gap between
sketch and video; (ii) to accurately identify and localize multiple objects
simultaneously; (iii) to handle various styles of sketches; (iv) to be
classification-free. In particular, SVANet is equipped with a Cross-modal
Transformer that models the interaction between learnable object tokens, query
sketch, and video through attention operations, and learns upon a per-frame set
matching strategy that enables frame-wise prediction while utilizing global
video context. We evaluate SVANet on a newly curated SVOL dataset. By design,
SVANet successfully learns the mapping between the query sketches and video
objects, achieving state-of-the-art results on the SVOL benchmark. We further
confirm the effectiveness of SVANet via extensive ablation studies and
visualizations. Lastly, we demonstrate its transfer capability on unseen
datasets and novel categories, suggesting its high scalability in real-world
applications.
Related papers
- Sketch Video Synthesis [52.134906766625164]
We propose a novel framework for sketching videos represented by the frame-wise B'ezier curve.
Our method unlocks applications in sketch-based video editing and video doodling, enabled through video composition.
arXiv Detail & Related papers (2023-11-26T14:14:04Z) - Sketch-based Video Object Segmentation: Benchmark and Analysis [55.79497833614397]
This paper introduces a new task of sketch-based video object segmentation, an associated benchmark, and a strong baseline.
Our benchmark includes three datasets, Sketch-DAVIS16, Sketch-DAVIS17 and Sketch-YouTube-VOS, which exploit human-drawn sketches as an informative yet low-cost reference for video object segmentation.
Experimental results show sketch is more effective yet annotation-efficient than other references, such as photo masks, language and scribble.
arXiv Detail & Related papers (2023-11-13T11:53:49Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z) - Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR)
Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level.
We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z) - Learning spatio-temporal representations with temporal squeeze pooling [11.746833714322154]
We propose a new video representation learning method, named Temporal Squeeze (TS) pooling, which can extract the essential movement information from a long sequence of video frames and map it into a set of few images, named Squeezed Images.
The resulting Squeezed Images contain the essential movement information from the video frames, corresponding to the optimization of the video classification task.
We evaluate our architecture on two video classification benchmarks, and the results achieved are compared to the state-of-the-art.
arXiv Detail & Related papers (2020-02-11T21:13:12Z) - SketchDesc: Learning Local Sketch Descriptors for Multi-view
Correspondence [68.63311821718416]
We study the problem of multi-view sketch correspondence, where we take as input multiple freehand sketches with different views of the same object.
This problem is challenging since the visual features of corresponding points at different views can be very different.
We take a deep learning approach and learn a novel local sketch descriptor from data.
arXiv Detail & Related papers (2020-01-16T11:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.