DetermiNet: A Large-Scale Diagnostic Dataset for Complex
Visually-Grounded Referencing using Determiners
- URL: http://arxiv.org/abs/2309.03483v1
- Date: Thu, 7 Sep 2023 05:13:52 GMT
- Title: DetermiNet: A Large-Scale Diagnostic Dataset for Complex
Visually-Grounded Referencing using Determiners
- Authors: Clarence Lee, M Ganesh Kumar, Cheston Tan
- Abstract summary: DetermiNet dataset comprises 250,000 synthetically generated images and captions based on 25 determiners.
The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner.
We find that current state-of-the-art visual grounding models do not perform well on the dataset.
- Score: 5.256237513030104
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art visual grounding models can achieve high detection accuracy,
but they are not designed to distinguish between all objects versus only
certain objects of interest. In natural language, in order to specify a
particular object or set of objects of interest, humans use determiners such as
"my", "either" and "those". Determiners, as an important word class, are a type
of schema in natural language about the reference or quantity of the noun.
Existing grounded referencing datasets place much less emphasis on determiners,
compared to other word classes such as nouns, verbs and adjectives. This makes
it difficult to develop models that understand the full variety and complexity
of object referencing. Thus, we have developed and released the DetermiNet
dataset , which comprises 250,000 synthetically generated images and captions
based on 25 determiners. The task is to predict bounding boxes to identify
objects of interest, constrained by the semantics of the given determiner. We
find that current state-of-the-art visual grounding models do not perform well
on the dataset, highlighting the limitations of existing models on reference
and quantification tasks.
Related papers
- ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way.
Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Automatic dataset generation for specific object detection [6.346581421948067]
We present a method to synthesize object-in-scene images, which can preserve the objects' detailed features without bringing irrelevant information.
Our result shows that in the synthesized image, the boundaries of objects blend very well with the background.
arXiv Detail & Related papers (2022-07-16T07:44:33Z) - Contrastive Object Detection Using Knowledge Graph Embeddings [72.17159795485915]
We compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs.
We propose a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.
arXiv Detail & Related papers (2021-12-21T17:10:21Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.