DIME: An Online Tool for the Visual Comparison of Cross-Modal Retrieval
Models
- URL: http://arxiv.org/abs/2010.09641v1
- Date: Mon, 19 Oct 2020 16:35:30 GMT
- Title: DIME: An Online Tool for the Visual Comparison of Cross-Modal Retrieval
Models
- Authors: Tony Zhao, Jaeyoung Choi, Gerald Friedland
- Abstract summary: Cross-modal retrieval relies on accurate models to retrieve relevant results for queries across modalities such as image, text, and video.
We present DIME, a modality-agnostic tool that handles multimodal datasets, trained models, and data preprocessors.
- Score: 5.725477071353354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-modal retrieval relies on accurate models to retrieve relevant results
for queries across modalities such as image, text, and video. In this paper, we
build upon previous work by tackling the difficulty of evaluating models both
quantitatively and qualitatively quickly. We present DIME (Dataset, Index,
Model, Embedding), a modality-agnostic tool that handles multimodal datasets,
trained models, and data preprocessors to support straightforward model
comparison with a web browser graphical user interface. DIME inherently
supports building modality-agnostic queryable indexes and extraction of
relevant feature embeddings, and thus effectively doubles as an efficient
cross-modal tool to explore and search through datasets.
Related papers
- Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a novel sandbox suite tailored for integrated data-model co-development.
This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models.
We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - AttributionScanner: A Visual Analytics System for Model Validation with Metadata-Free Slice Finding [29.07617945233152]
Data slice finding is an emerging technique for validating machine learning (ML) models by identifying and analyzing subgroups in a dataset that exhibit poor performance.
This approach faces significant challenges, including the laborious and costly requirement for additional metadata.
We introduce AttributionScanner, an innovative human-in-the-loop Visual Analytics (VA) system, designed for metadata-free data slice finding.
Our system identifies interpretable data slices that involve common model behaviors and visualizes these patterns through an Attribution Mosaic design.
arXiv Detail & Related papers (2024-01-12T09:17:32Z) - Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal
Sponsored Search [27.42717207107]
Cross-Modal sponsored search displays multi-modal advertisements (ads) when consumers look for desired products by natural language queries in search engines.
The ability to align ads-specific information in both images and texts is crucial for accurate and flexible sponsored search.
We propose a simple alignment network for explicitly mapping fine-grained visual parts in ads images to the corresponding text.
arXiv Detail & Related papers (2023-09-28T03:43:57Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Visualising Deep Network's Time-Series Representations [93.73198973454944]
Despite the popularisation of machine learning models, more often than not they still operate as black boxes with no insight into what is happening inside the model.
In this paper, a method that addresses that issue is proposed, with a focus on visualising multi-dimensional time-series data.
Experiments on a high-frequency stock market dataset show that the method provides fast and discernible visualisations.
arXiv Detail & Related papers (2021-03-12T09:53:34Z) - Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think! [26.215781778606168]
Cross-modal modeling seems crucial in multimodal tasks, such as visual question answering.
We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task.
For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation.
arXiv Detail & Related papers (2020-10-13T17:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.