T5Gemma 2: Seeing, Reading, and Understanding Longer
- URL: http://arxiv.org/abs/2512.14856v2
- Date: Tue, 23 Dec 2025 20:01:24 GMT
- Title: T5Gemma 2: Seeing, Reading, and Understanding Longer
- Authors: Biao Zhang, Paul Suganthan, Gaƫl Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, Cassidy Hardin, Francesco Visin, Jiageng Zhang, Kathleen Kenealy, Qin Yin, Xiaodan Song, Olivier Lacombe, Armand Joulin, Tris Warkentin, Adam Roberts,
- Abstract summary: We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models.<n>T5Gemma 2 features strong multilingual, multimodal and long-context capabilities.
- Score: 16.370153525727066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma -- adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.
Related papers
- Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation [52.19855651708349]
We study a novel problem: adapting decoder-only large language models to encoder-decoder models.<n>We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation.<n>Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart.
arXiv Detail & Related papers (2025-04-08T17:13:41Z) - Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks [24.674661807982865]
We introduce Gemma, adapting the powerful decoder model to an encoder architecture.<n>To optimize the adaptation from decoder to encoder, we analyze various pooling strategies.<n>We benchmark Gemma against established approaches on the GLUE benchmarks, and MS MARCO ranking benchmark.
arXiv Detail & Related papers (2025-03-04T14:17:00Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [191.7830199016589]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Joint Beam Search Integrating CTC, Attention, and Transducer Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder.<n>The 4D model is trained jointly, which will bring model regularization and maximize the model robustness.<n>In addition, we propose three novel joint beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - MUSTER: A Multi-scale Transformer-based Decoder for Semantic Segmentation [19.83103856355554]
MUSTER is a transformer-based decoder that seamlessly integrates with hierarchical encoders.
MSKA units enable the fusion of multi-scale features from the encoder and decoder, facilitating comprehensive information integration.
On the challenging ADE20K dataset, our best model achieves a single-scale mIoU of 50.23 and a multi-scale mIoU of 51.88.
arXiv Detail & Related papers (2022-11-25T06:51:07Z) - Z-Code++: A Pre-trained Language Model Optimized for Abstractive
Summarization [108.09419317477986]
Z-Code++ is a new pre-trained language model optimized for abstractive text summarization.
The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation.
Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum.
arXiv Detail & Related papers (2022-08-21T01:00:54Z) - EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks [9.141586109808895]
We study fine-tuning pre-trained encoder-decoder models such as T5.
Our experimental results show that textbfEncT5 with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark.
arXiv Detail & Related papers (2021-10-16T00:50:08Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z) - Improved Multi-Stage Training of Online Attention-based Encoder-Decoder
Models [20.81248613653279]
We propose a refined multi-stage multi-task training strategy to improve the performance of online attention-based encoder-decoder models.
A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed.
Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively.
arXiv Detail & Related papers (2019-12-28T02:29:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.