Related papers: Memory Enhanced Global-Local Aggregation for Video Object Detection

Memory Enhanced Global-Local Aggregation for Video Object Detection

URL: http://arxiv.org/abs/2003.12063v1
Date: Thu, 26 Mar 2020 17:59:38 GMT
Title: Memory Enhanced Global-Local Aggregation for Video Object Detection
Authors: Yihong Chen, Yue Cao, Han Hu, Liwei Wang
Abstract summary: We argue that there are two important cues for humans to recognize objects in videos: the global semantic information and the local localization information. We introduce memory enhanced global-local aggregation (MEGA) network. Our method achieves state-of-the-art performance on ImageNet VID dataset.
Score: 33.624831537299734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How do humans recognize an object in a piece of video? Due to the deteriorated quality of single frame, it may be hard for people to identify an occluded object in this frame by just utilizing information within one image. We argue that there are two important cues for humans to recognize objects in videos: the global semantic information and the local localization information. Recently, plenty of methods adopt the self-attention mechanisms to enhance the features in key frame with either global semantic information or local localization information. In this paper we introduce memory enhanced global-local aggregation (MEGA) network, which is among the first trials that takes full consideration of both global and local information. Furthermore, empowered by a novel and carefully-designed Long Range Memory (LRM) module, our proposed MEGA could enable the key frame to get access to much more content than any previous methods. Enhanced by these two sources of information, our method achieves state-of-the-art performance on ImageNet VID dataset. Code is available at \url{https://github.com/Scalsol/mega.pytorch}.

Related papers

LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation [14.277537679679101]
We propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features.<n>We show our method achieves new state-of-the-art on four referring video segmentation benchmarks.
arXiv Detail & Related papers (2025-10-09T14:55:52Z)
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z)
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation [22.769692511220327]
This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS) Our framework shows that global and local consistency can be unified into a single video segmentation MLLM. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects.
arXiv Detail & Related papers (2025-04-10T17:59:55Z)
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos [42.88584315033116]
We present REM, a framework for segmenting concepts in video that can be described through natural language. Our method capitalizes on visual representations learned by video diffusion models on Internet-scale datasets.
arXiv Detail & Related papers (2024-10-30T17:59:26Z)
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features. Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training. We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z)
Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z)
Referring Camouflaged Object Detection [97.90911862979355]
Ref-COD aims to segment specified camouflaged objects based on a small set of referring images with salient target objects. We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios.
arXiv Detail & Related papers (2023-06-13T04:15:37Z)
Local-Aware Global Attention Network for Person Re-Identification Based on Body and Hand Images [0.0]
We propose a compound approach for end-to-end discriminative deep feature learning for person Re-Id based on both body and hand images. The proposed method consistently outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2022-09-11T09:43:42Z)
L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic Segmentation [67.26984058377435]
We present L2G, a simple online local-to-global knowledge transfer framework for high-quality object attention mining. Our framework conducts the global network to learn the captured rich object detail knowledge from a global view. Experiments show that our method attains 72.1% and 44.2% mIoU scores on the validation set of PASCAL VOC 2012 and MS COCO 2014.
arXiv Detail & Related papers (2022-04-07T04:31:32Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)
Efficient Regional Memory Network for Video Object Segmentation [56.587541750729045]
We propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet) The proposed RMNet effectively alleviates the ambiguity of similar objects in both memory and query frames. Experimental results indicate that the proposed RMNet performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-03-24T02:08:46Z)
Gait Recognition via Effective Global-Local Feature Representation and Local Temporal Aggregation [28.721376937882958]
Gait recognition is one of the most important biometric technologies and has been applied in many fields. Recent gait recognition frameworks represent each gait frame by descriptors extracted from either global appearances or local regions of humans. We propose a novel feature extraction and fusion framework to achieve discriminative feature representations for gait recognition.
arXiv Detail & Related papers (2020-11-03T04:07:13Z)
An Explicit Local and Global Representation Disentanglement Framework with Applications in Deep Clustering and Unsupervised Object Detection [9.609936822226633]
We propose a framework, called SPLIT, which allows us to disentangle local and global information. Our framework adds generative assumption to the variational autoencoder (VAE) framework. We show that the framework can effectively disentangle local and global information within these models.
arXiv Detail & Related papers (2020-01-24T12:09:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.