Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
- URL: http://arxiv.org/abs/2412.16243v1
- Date: Thu, 19 Dec 2024 20:52:10 GMT
- Title: Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
- Authors: Zhiqiang Tang, Zihan Zhong, Tong He, Gerald Friedland,
- Abstract summary: This paper studies the best practices for automatic machine learning (AutoML)
We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications.
Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline.
- Score: 9.325441307607225
- License:
- Abstract: This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation [19.745059794932807]
We introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task.
This task aims to generate answers that combine both text and images, fully leveraging the multimodal data within a corpus.
We introduce the MRAMG-Bench, which incorporates a comprehensive suite of both statistical and LLM-based metrics.
arXiv Detail & Related papers (2025-02-06T16:07:24Z) - FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [64.50893177169996]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.
We introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios.
We develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic
Fusion Prompts [30.15646658460899]
Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media.
Existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect.
We propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario.
arXiv Detail & Related papers (2022-11-12T08:10:35Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for
Multimodal Sentiment Detection [24.243349217940274]
We propose a Contrastive Learning and Multi-Layer Fusion (CLMLF) method for multimodal sentiment detection.
Specifically, we first encode text and image to obtain hidden representations, and then use a multi-layer fusion module to align and fuse the token-level features of text and image.
In addition to the sentiment analysis task, we also designed two contrastive learning tasks, label based contrastive learning and data based contrastive learning tasks.
arXiv Detail & Related papers (2022-04-12T04:03:06Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.