Related papers: Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

URL: http://arxiv.org/abs/2505.23193v1
Date: Thu, 29 May 2025 07:31:39 GMT
Title: Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images
Authors: Sungjune Park, Hyunjun Kim, Beomchan Park, Yong Man Ro,
Abstract summary: In this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO)<n>Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations.<n>Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.
Score: 47.29074873769022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.

Related papers

Improving Object Detection via Local-global Contrastive Learning [27.660633883387753]
We present a novel image-to-image translation method that specifically targets cross-domain object detection. We learn to represent objects by contrasting local-global information. This affords investigation of an under-explored challenge: obtaining performant detection, under domain shifts.
arXiv Detail & Related papers (2024-10-07T14:18:32Z)
SemAug: Semantically Meaningful Image Augmentations for Object Detection Through Language Grounding [5.715548995729382]
We propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes. Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects.
arXiv Detail & Related papers (2022-08-15T19:00:56Z)
Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner. We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z)
A Simple and Effective Use of Object-Centric Images for Long-Tailed Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images. We present a simple yet surprisingly effective framework to do so. Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z)
Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z)
Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning [41.044241265804125]
We propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. We also propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards. Our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.
arXiv Detail & Related papers (2020-09-30T00:13:49Z)
Improving Object Detection with Selective Self-supervised Self-training [62.792445237541145]
We study how to leverage Web images to augment human-curated object detection datasets. We retrieve Web images by image-to-image search, which incurs less domain shift from the curated data than other search methods. We propose a novel learning method motivated by two parallel lines of work that explore unlabeled data for image classification.
arXiv Detail & Related papers (2020-07-17T18:05:01Z)
COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.