Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A
Reproducibility Study
- URL: http://arxiv.org/abs/2301.05174v2
- Date: Tue, 10 Oct 2023 22:58:45 GMT
- Title: Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A
Reproducibility Study
- Authors: Mariya Hendriksen, Svitlana Vakulenko, Ernst Kuiper, Maarten de Rijke
- Abstract summary: Cross-modal retrieval (CMR) approaches usually focus on object-centric datasets.
This paper focuses on the results and their generalizability across different dataset types.
We select two state-of-the-art CMR models with different architectures.
We determine the relative performance of the selected models on these datasets.
- Score: 55.964387734180114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most approaches to cross-modal retrieval (CMR) focus either on object-centric
datasets, meaning that each document depicts or describes a single object, or
on scene-centric datasets, meaning that each image depicts or describes a
complex scene that involves multiple objects and relations between them. We
posit that a robust CMR model should generalize well across both dataset types.
Despite recent advances in CMR, the reproducibility of the results and their
generalizability across different dataset types has not been studied before. We
address this gap and focus on the reproducibility of the state-of-the-art CMR
results when evaluated on object-centric and scene-centric datasets. We select
two state-of-the-art CMR models with different architectures: (i) CLIP; and
(ii) X-VLM. Additionally, we select two scene-centric datasets, and three
object-centric datasets, and determine the relative performance of the selected
models on these datasets. We focus on reproducibility, replicability, and
generalizability of the outcomes of previously published CMR experiments. We
discover that the experiments are not fully reproducible and replicable.
Besides, the relative performance results partially generalize across
object-centric and scene-centric datasets. On top of that, the scores obtained
on object-centric datasets are much lower than the scores obtained on
scene-centric datasets. For reproducibility and transparency we make our source
code and the trained models publicly available.
Related papers
- FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the
Test [0.6827423171182154]
"dacl1k" is a multi-label RCD dataset for multi-label classification based on building inspections including 1,474 images.
We trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically.
The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%.
arXiv Detail & Related papers (2023-09-07T15:05:35Z) - OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for
Object-Centric Learning [41.09407455527254]
We propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes.
OCTScenes contains 5000 tabletop scenes with a total of 15 objects.
It is meticulously designed to serve as a benchmark for comparing, evaluating, and analyzing object-centric learning methods.
arXiv Detail & Related papers (2023-06-16T08:26:57Z) - MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.