PaliGemma 2: A Family of Versatile VLMs for Transfer
- URL: http://arxiv.org/abs/2412.03555v1
- Date: Wed, 04 Dec 2024 18:50:42 GMT
- Title: PaliGemma 2: A Family of Versatile VLMs for Transfer
- Authors: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai,
- Abstract summary: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models.
We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model.
We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning.
- Score: 48.68777561571185
- License:
- Abstract: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
Related papers
- DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 [6.6954598568836925]
DiM-Gestor is an end-to-end generative model leveraging the Mamba-2 architecture.
A fuzzy feature extractor and a speech-to-gesture mapping module are built on the Mamba-2.
Our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times.
arXiv Detail & Related papers (2024-11-23T08:02:03Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.
We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.
Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition [26.721242606715354]
Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns.
We propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA)
First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors.
arXiv Detail & Related papers (2024-07-20T09:05:17Z) - PaliGemma: A versatile 3B VLM for transfer [112.41933621495446]
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model.
We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
arXiv Detail & Related papers (2024-07-10T14:57:46Z) - Masked Pre-Training of Transformers for Histology Image Analysis [4.710921988115685]
In digital pathology, whole slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction.
Visual transformer models have emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches.
We propose a pretext task for training the transformer model without labeled data to address this problem.
Our model, MaskHIT, uses the transformer output to reconstruct masked patches and learn representative histological features based on their positions and visual features.
arXiv Detail & Related papers (2023-04-14T23:56:49Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - Part-aware Prototypical Graph Network for One-shot Skeleton-based Action
Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes.
We propose a part-aware prototypical representation for one-shot skeleton-based action recognition.
We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Deep brain state classification of MEG data [2.9048924265579124]
This paper uses Magnetoencephalography (MEG) data, provided by the Human Connectome Project (HCP), in combination with various deep artificial neural network models to perform brain decoding.
arXiv Detail & Related papers (2020-07-02T05:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.