Multimodal Metadata Assignment for Cultural Heritage Artifacts
- URL: http://arxiv.org/abs/2406.00423v1
- Date: Sat, 1 Jun 2024 12:41:03 GMT
- Title: Multimodal Metadata Assignment for Cultural Heritage Artifacts
- Authors: Luis Rei, Dunja Mladenić, Mareike Dorozynski, Franz Rottensteiner, Thomas Schleider, Raphaël Troncy, Jorge Sebastián Lozano, Mar Gaitán Salvatella,
- Abstract summary: We develop a multimodal classifier for the cultural heritage domain using a late fusion approach.
The three modalities are Image, Text, and Tabular data.
All individual classifiers accurately predict missing properties in the digitized silk artifacts, with the multimodal approach providing the best results.
- Score: 1.5826261914050386
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We develop a multimodal classifier for the cultural heritage domain using a late fusion approach and introduce a novel dataset. The three modalities are Image, Text, and Tabular data. We based the image classifier on a ResNet convolutional neural network architecture and the text classifier on a multilingual transformer architecture (XML-Roberta). Both are trained as multitask classifiers and use the focal loss to handle class imbalance. Tabular data and late fusion are handled by Gradient Tree Boosting. We also show how we leveraged specific data models and taxonomy in a Knowledge Graph to create the dataset and to store classification results. All individual classifiers accurately predict missing properties in the digitized silk artifacts, with the multimodal approach providing the best results.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Replica Tree-based Federated Learning using Limited Data [6.572149681197959]
In this work, we propose a novel federated learning framework, named RepTreeFL.
At the core of the solution is the concept of a replica, where we replicate each participating client by copying its model architecture and perturbing its local data distribution.
Our approach enables learning from limited data and a small number of clients by aggregating a larger number of models with diverse data distributions.
arXiv Detail & Related papers (2023-12-28T17:47:25Z) - MotherNet: A Foundational Hypernetwork for Tabular Classification [1.9643748953805937]
We propose a hypernetwork architecture that we call MotherNet, trained on millions of classification tasks.
MotherNet replaces training on specific datasets with in-context learning through a single forward pass.
The child network generated by MotherNet using in-context learning outperforms neural networks trained using gradient descent on small datasets.
arXiv Detail & Related papers (2023-12-14T01:48:58Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Black Box to White Box: Discover Model Characteristics Based on
Strategic Probing [0.0]
White Box Adversarial Attacks rely on knowing underlying knowledge about the model attributes.
This works focuses on discovering to distrinct pieces of model information: the underlying architecture and primary training dataset.
With image classification, the focus is on exploring commonly deployed architectures and datasets available in popular public libraries.
Using a single transformer architecture with multiple levels of parameters, text generation is explored by fine tuning off different datasets.
Each dataset explored in image and text are distinguishable from one another.
arXiv Detail & Related papers (2020-09-07T14:44:28Z) - End-to-End Entity Classification on Multimodal Knowledge Graphs [0.0]
We propose a multimodal message passing network which learns end-to-end from the structure of graphs.
Our model uses dedicated (neural) encoders to naturally learn embeddings for node features belonging to five different types of modalities.
Our result supports our hypothesis that including information from multiple modalities can help our models obtain a better overall performance.
arXiv Detail & Related papers (2020-03-25T14:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.