LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
- URL: http://arxiv.org/abs/2309.00789v2
- Date: Mon, 24 Jun 2024 21:01:58 GMT
- Title: LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
- Authors: Abhishek Arora, Melissa Dell,
- Abstract summary: LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning.
At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code.
LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages.
- Score: 2.07180164747172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
Related papers
- Demystifying the Communication Characteristics for Distributed Transformer Models [2.849208476795592]
This paper examines the communication behavior of transformer models.
We use GPT-based language models as a case study of the transformer architecture due to their ubiquity.
At a high level, our analysis reveals a need to optimize small message point-to-point communication further.
arXiv Detail & Related papers (2024-08-19T17:54:29Z) - Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models [42.46104516313823]
Dependency Transformer Grammars (DTGs) are a new class of Transformer language model with explicit dependency-based inductive bias.
DTGs simulate dependency transition systems with constrained attention patterns.
They achieve better generalization while maintaining comparable perplexity with Transformer language model baselines.
arXiv Detail & Related papers (2024-07-24T16:38:38Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities.
Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities.
We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z) - Trankit: A Light-Weight Transformer-based Toolkit for Multilingual
Natural Language Processing [22.38792093462942]
Trankit is a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP)
It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages.
Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing.
arXiv Detail & Related papers (2021-01-09T04:55:52Z) - XLM-T: Scaling up Multilingual Machine Translation with Pretrained
Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data.
This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Multi-channel Transformers for Multi-articulatory Sign Language
Translation [59.38247587308604]
We tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture.
The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself.
arXiv Detail & Related papers (2020-09-01T09:10:55Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.