Statistical Depth for Ranking and Characterizing Transformer-Based Text
Embeddings
- URL: http://arxiv.org/abs/2310.15010v1
- Date: Mon, 23 Oct 2023 15:02:44 GMT
- Title: Statistical Depth for Ranking and Characterizing Transformer-Based Text
Embeddings
- Authors: Parker Seegmiller and Sarah Masud Preum
- Abstract summary: A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution.
We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines.
- Score: 1.321681963474017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The popularity of transformer-based text embeddings calls for better
statistical tools for measuring distributions of such embeddings. One such tool
would be a method for ranking texts within a corpus by centrality, i.e.
assigning each text a number signifying how representative that text is of the
corpus as a whole. However, an intrinsic center-outward ordering of
high-dimensional text representations is not trivial. A statistical depth is a
function for ranking k-dimensional objects by measuring centrality with respect
to some observed k-dimensional distribution. We adopt a statistical depth to
measure distributions of transformer-based text embeddings, transformer-based
text embedding (TTE) depth, and introduce the practical use of this depth for
both modeling and distributional inference in NLP pipelines. We first define
TTE depth and an associated rank sum test for determining whether two corpora
differ significantly in embedding space. We then use TTE depth for the task of
in-context learning prompt selection, showing that this approach reliably
improves performance over statistical baseline approaches across six text
classification tasks. Finally, we use TTE depth and the associated rank sum
test to characterize the distributions of synthesized and human-generated
corpora, showing that five recent synthetic data augmentation processes cause a
measurable distributional shift away from associated human-generated text.
Related papers
- TexIm FAST: Text-to-Image Representation for Semantic Similarity Evaluation using Transformers [2.7651063843287718]
TexIm FAST is a novel methodology for generating fixed-length representations through a self-supervised Variational Auto-Encoder (VAE) for semantic evaluation applying transformers (TexIm FAST)
The pictorial representations allow oblivious inference while retaining the linguistic intricacies, and are potent in cross-modal applications.
The efficacy of TexIm FAST has been extensively analyzed for the task of Semantic Textual Similarity (STS) upon the MSRPC, CNN/ Daily Mail, and XSum data-sets.
arXiv Detail & Related papers (2024-06-06T18:28:50Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - LRANet: Towards Accurate and Efficient Scene Text Detection with
Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation.
By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation.
We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TextDCT: Arbitrary-Shaped Text Detection via Discrete Cosine Transform
Mask [19.269070203448187]
Arbitrary-shaped scene text detection is a challenging task due to the variety of text changes in font, size, color, and orientation.
We propose a novel light-weight anchor-free text detection framework called TextDCT, which adopts the discrete cosine transform (DCT) to encode the text masks as compact vectors.
TextDCT achieves F-measure of 85.1 at 17.2 frames per second (FPS) and F-measure of 84.9 at 15.1 FPS for CTW1500 and Total-Text datasets, respectively.
arXiv Detail & Related papers (2022-06-27T15:42:25Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Improving Text Generation Evaluation with Batch Centering and Tempered
Word Mover Distance [24.49032191669509]
We present two techniques for improving encoding representations for similarity metrics.
We show results over various BERT-backbone learned metrics and achieving state of the art correlation with human ratings on several benchmarks.
arXiv Detail & Related papers (2020-10-13T03:46:25Z) - An Intelligent CNN-VAE Text Representation Technology Based on Text
Semantics for Comprehensive Big Data [15.680918844684454]
A text feature representation model based on convolutional neural network (CNN) and variational autoencoder (VAE) is proposed.
The proposed model outperforms in k-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) classification algorithms.
arXiv Detail & Related papers (2020-08-28T07:39:45Z) - Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [49.768327669098674]
We propose an end-to-end trainable text spotting approach named Text Perceptron.
It first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information.
Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies.
arXiv Detail & Related papers (2020-02-17T08:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.