Related papers: Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

URL: http://arxiv.org/abs/2507.14353v1
Date: Fri, 18 Jul 2025 20:11:50 GMT
Title: Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers
Authors: Harsh Nilesh Pathak, Randy Paffenroth,
Abstract summary: Solo Connection is a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices.<n>Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59%.<n>This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model's ability to adapt to new tasks while leveraging pre-trained knowledge.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parameter efficient fine tuning (PEFT) is a versatile and extensible approach for adapting a Large Language Model (LLM) for newer tasks. One of the most prominent PEFT approaches, Low Rank Adaptation (LoRA), primarily focuses on adjusting the attention weight matrices within individual decoder blocks of a Generative Pre trained Transformer (GPT2). In contrast, we introduce Solo Connection a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices. Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59% relative to LoRA and by more than 99% compared to full fine-tuning of GPT2, an early version of Large Language Models (LLMs). Solo Connection is also motivated by homotopy theory: we introduce a trainable linear transformation that gradually interpolates between a zero vector and the task-specific representation, enabling smooth and stable adaptation over time. While skip connections in the original 12 layer GPT2 are typically confined to individual decoder blocks, subsequent GPT2 variants scale up to 48 layers, and even larger language models can include 128 or more decoder blocks. These expanded architectures underscore the need to revisit how skip connections are employed during fine-tuning. This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model's ability to adapt to new tasks while leveraging pre-trained knowledge.

Related papers

Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs [8.26418657158164]
approach is a technique to adapt large models for downstream code tasks using Code Property Graphs (CPGs)<n>Our approach introduces a modular component called transducer that enriches code embeddings with structural and dependency information from CPGs.<n>Our results demonstrate competitive performance compared to full parameter fine-tuning while reducing up to 99% trainable parameters to save memory.
arXiv Detail & Related papers (2024-12-18T03:25:17Z)
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance [87.19164603145056]
We propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency. We design an anchor-free head solely based on to adapt PETR, enabling better performance with less computational overhead.
arXiv Detail & Related papers (2024-03-08T11:41:48Z)
WeGeFT: Weight-Generative Fine-Tuning for Multi-Faceted Efficient Adaptation of Large Models [8.481707805559589]
We propose Weight-Generative Fine-Tuning (WeGeFT), a novel approach that learns to generate fine-tuning weights directly from the pretrained weights.<n>This design achieves multi-faceted efficiency in parameters, representations, compute, and memory, while maintaining or exceeding the performance of LoRA and its variants.
arXiv Detail & Related papers (2023-12-01T16:33:57Z)
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains. Recent PETL works focus on the more valuable memory-efficient characteristic. We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)
LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)
Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models. With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.