PaliGemma 2: A Family of Versatile VLMs for Transfer
- URL: http://arxiv.org/abs/2412.03555v1
- Date: Wed, 04 Dec 2024 18:50:42 GMT
- Title: PaliGemma 2: A Family of Versatile VLMs for Transfer
- Authors: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai,
- Abstract summary: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models.<n>We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model.<n>We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning.
- Score: 48.68777561571185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
Related papers
- Enhancing Vehicle Make and Model Recognition with 3D Attention Modules [1.4999444543328293]
Vehicle make and model recognition (VMMR) is a crucial component of the Intelligent Transport System.
In this study, we implement an attention module to address inter-class similarity and intra-class variation challenges.
Our proposed model integrates the attention module into two different locations within the middle section of a convolutional model.
arXiv Detail & Related papers (2025-02-21T11:52:56Z) - DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 [6.6954598568836925]
DiM-Gestor is an end-to-end generative model leveraging the Mamba-2 architecture.
A fuzzy feature extractor and a speech-to-gesture mapping module are built on the Mamba-2.
Our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times.
arXiv Detail & Related papers (2024-11-23T08:02:03Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition [26.721242606715354]
Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns.
We propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA)
First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors.
arXiv Detail & Related papers (2024-07-20T09:05:17Z) - PaliGemma: A versatile 3B VLM for transfer [112.41933621495446]
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model.
We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
arXiv Detail & Related papers (2024-07-10T14:57:46Z) - Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models [26.322856874796702]
Vision transformers (ViTs) struggle to capture fine-grained details from less prominent objects, charts, and embedded text.
We extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it.
This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs.
arXiv Detail & Related papers (2024-06-03T04:17:12Z) - Dual Attention Model with Reinforcement Learning for Classification of Histology Whole-Slide Images [8.404881822414898]
Digital whole slide images (WSIs) are generally captured at microscopic resolution and encompass extensive spatial data.
We propose a novel dual attention approach, consisting of two main components, both inspired by the visual examination process of a pathologist.
We show that the proposed model achieves performance better than or comparable to the state-of-the-art methods while processing less than 10% of the WSI at the highest magnification.
arXiv Detail & Related papers (2023-02-19T22:26:25Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - Part-aware Prototypical Graph Network for One-shot Skeleton-based Action
Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes.
We propose a part-aware prototypical representation for one-shot skeleton-based action recognition.
We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Deep brain state classification of MEG data [2.9048924265579124]
This paper uses Magnetoencephalography (MEG) data, provided by the Human Connectome Project (HCP), in combination with various deep artificial neural network models to perform brain decoding.
arXiv Detail & Related papers (2020-07-02T05:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.