Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search
- URL: http://arxiv.org/abs/2009.01438v1
- Date: Thu, 3 Sep 2020 03:57:50 GMT
- Title: Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search
- Authors: Lei Zhang and Zhenwei He and Yi Yang and Liang Wang and Xinbo Gao
- Abstract summary: In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
- Score: 99.49021025124405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The traditional object retrieval task aims to learn a discriminative feature
representation with intra-similarity and inter-dissimilarity, which supposes
that the objects in an image are manually or automatically pre-cropped exactly.
However, in many real-world searching scenarios (e.g., video surveillance), the
objects (e.g., persons, vehicles, etc.) are seldom accurately detected or
annotated. Therefore, object-level retrieval becomes intractable without
bounding-box annotation, which leads to a new but challenging topic, i.e.
image-level search. In this paper, to address the image search issue, we first
introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A
Siamese architecture and an on-line pairing strategy for similar and dissimilar
objects in the given images are designed. 2) A novel on-line pairing (OLP) loss
is introduced with a dynamic feature dictionary, which alleviates the
multi-task training stagnation problem, by automatically generating a number of
negative pairs to restrict the positives. 3) A hard example priority (HEP)
based softmax loss is proposed to improve the robustness of classification task
by selecting hard categories. With the philosophy of divide and conquer, we
further propose an improved I-Net, called DC-I-Net, which makes two new
contributions: 1) two modules are tailored to handle different tasks separately
in the integrated framework, such that the task specification is guaranteed. 2)
A class-center guided HEP loss (C2HEP) by exploiting the stored class centers
is proposed, such that the intra-similarity and inter-dissimilarity can be
captured for ultimate retrieval. Extensive experiments on famous image-level
search oriented benchmark datasets demonstrate that the proposed DC-I-Net
outperforms the state-of-the-art tasks-integrated and tasks-separated image
search models.
Related papers
- Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Advancing Image Retrieval with Few-Shot Learning and Relevance Feedback [5.770351255180495]
Image Retrieval with Relevance Feedback (IRRF) involves iterative human interaction during the retrieval process.
We propose a new scheme based on a hyper-network, that is tailored to the task and facilitates swift adjustment to user feedback.
We show that our method can attain SoTA results in few-shot one-class classification and reach comparable results in binary classification task of few-shot open-set recognition.
arXiv Detail & Related papers (2023-12-18T10:20:28Z) - MatchDet: A Collaborative Framework for Image Matching and Object Detection [33.09209198536698]
We propose a collaborative framework called MatchDet for image matching and object detection.
To achieve the collaborative learning of the two tasks, we propose three novel modules.
We evaluate the approaches on a new benchmark with two datasets called Warp-COCO and miniScanNet.
arXiv Detail & Related papers (2023-12-18T07:11:45Z) - Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs.
We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z) - Improving Long-tailed Object Detection with Image-Level Supervision by
Multi-Task Collaborative Learning [18.496765732728164]
We propose a novel framework, CLIS, which leverage image-level supervision to enhance the detection ability in a multi-task collaborative way.
CLIS achieves an overall AP of 31.1 with 10.1 point improvement on tail categories, establishing a new state-of-the-art.
arXiv Detail & Related papers (2022-10-11T16:02:14Z) - LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information
Retrieval [0.4264192013842096]
Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for information retrieval.
Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs.
This paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function.
arXiv Detail & Related papers (2022-10-10T15:09:39Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.