Extracting Text Representations for Terms and Phrases in Technical
Domains
- URL: http://arxiv.org/abs/2305.15867v1
- Date: Thu, 25 May 2023 08:59:36 GMT
- Title: Extracting Text Representations for Terms and Phrases in Technical
Domains
- Authors: Francesco Fusco, Diego Antognini
- Abstract summary: We propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices.
Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster.
- Score: 9.27244202193623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extracting dense representations for terms and phrases is a task of great
importance for knowledge discovery platforms targeting highly-technical fields.
Dense representations are used as features for downstream components and have
multiple applications ranging from ranking results in search to summarization.
Common approaches to create dense representations include training
domain-specific embeddings with self-supervised setups or using sentence
encoder models trained over similarity tasks. In contrast to static embeddings,
sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but
impose significant computational costs. In this paper, we propose a fully
unsupervised approach to text encoding that consists of training small
character-based models with the objective of reconstructing large pre-trained
embedding matrices. Models trained with this approach can not only match the
quality of sentence encoders in technical domains, but are 5 times smaller and
up to 10 times faster, even on high-end GPUs.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Harnessing the Power of Beta Scoring in Deep Active Learning for
Multi-Label Text Classification [6.662167018900634]
Our study introduces a novel deep active learning strategy, capitalizing on the Beta family of proper scoring rules within the Expected Loss Reduction framework.
It computes the expected increase in scores using the Beta Scoring Rules, which are then transformed into sample vector representations.
Comprehensive evaluations across both synthetic and real datasets reveal our method's capability to often outperform established acquisition techniques in multi-label text classification.
arXiv Detail & Related papers (2024-01-15T00:06:24Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization [76.57699934689468]
We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models.
To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
arXiv Detail & Related papers (2023-05-18T16:02:04Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information [29.633735942273997]
XRayEmb is a method for retrofitting existing token-based models with character-level information.
We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
arXiv Detail & Related papers (2021-08-01T08:09:26Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - DeCLUTR: Deep Contrastive Learning for Unsupervised Textual
Representations [4.36561468436181]
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.
Our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.
Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
arXiv Detail & Related papers (2020-06-05T20:00:28Z) - Text Classification with Few Examples using Controlled Generalization [58.971750512415134]
Current practice relies on pre-trained word embeddings to map words unseen in training to similar seen ones.
Our alternative begins with sparse pre-trained representations derived from unlabeled parsed corpora.
We show that a feed-forward network over these vectors is especially effective in low-data scenarios.
arXiv Detail & Related papers (2020-05-18T06:04:58Z) - General Purpose Text Embeddings from Pre-trained Language Models for
Scalable Inference [34.47592026375839]
We show that some of the computational cost during inference can be amortized over the different tasks using a shared text encoder.
We also compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks.
arXiv Detail & Related papers (2020-04-29T16:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.