Context Enhanced Transformer for Single Image Object Detection
- URL: http://arxiv.org/abs/2312.14492v2
- Date: Tue, 26 Dec 2023 05:54:22 GMT
- Title: Context Enhanced Transformer for Single Image Object Detection
- Authors: Seungjun An, Seonghoon Park, Gyeongnyeon Kim, Jeongyeol Baek,
Byeongwon Lee, Seungryong Kim
- Abstract summary: We propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR)
To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data.
We present a classification-based sampling technique to selectively utilize the relevant memory for the current image.
- Score: 31.52466523847246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the increasing importance of video data in real-world applications,
there is a rising need for efficient object detection methods that utilize
temporal information. While existing video object detection (VOD) techniques
employ various strategies to address this challenge, they typically depend on
locally adjacent frames or randomly sampled images within a clip. Although
recent Transformer-based VOD methods have shown promising results, their
reliance on multiple inputs and additional network complexity to incorporate
temporal information limits their practical applicability. In this paper, we
propose a novel approach to single image object detection, called Context
Enhanced TRansformer (CETR), by incorporating temporal context into DETR using
a newly designed memory module. To efficiently store temporal information, we
construct a class-wise memory that collects contextual information across data.
Additionally, we present a classification-based sampling technique to
selectively utilize the relevant memory for the current image. In the testing,
We introduce a test-time memory adaptation method that updates individual
memory functions by considering the test distribution. Experiments with CityCam
and ImageNet VID datasets exhibit the efficiency of the framework on various
video systems. The project page and code will be made available at:
https://ku-cvlab.github.io/CETR.
Related papers
- Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data.
Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z) - ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations.
We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings.
We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - Robust and efficient post-processing for video object detection [9.669942356088377]
This work introduces a novel post-processing pipeline that overcomes some of the limitations of previous post-processing methods.
Our method improves the results of state-of-the-art specific video detectors, specially regarding fast moving objects.
And applied to efficient still image detectors, such as YOLO, provides comparable results to much more computationally intensive detectors.
arXiv Detail & Related papers (2020-09-23T10:47:24Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.