Vectorizing string entries for data processing on tables: when are
larger language models better?
- URL: http://arxiv.org/abs/2312.09634v1
- Date: Fri, 15 Dec 2023 09:23:56 GMT
- Title: Vectorizing string entries for data processing on tables: when are
larger language models better?
- Authors: L\'eo Grinsztajn (SODA, MLIA, ISIR), Edouard Oyallon (MLIA, CNRS,
ISIR, SU), Myung Jun Kim (SODA), Ga\"el Varoquaux (SODA)
- Abstract summary: We study the benefits of language models in 14 analytical tasks on tables.
We show that larger language models tend to perform better, but it is useful to fine tune them for embedding purposes.
- Score: 1.0840985826142429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are increasingly efficient data processing pipelines that work on
vectors of numbers, for instance most machine learning models, or vector
databases for fast similarity search. These require converting the data to
numbers. While this conversion is easy for simple numerical and categorical
entries, databases are strife with text entries, such as names or descriptions.
In the age of large language models, what's the best strategies to vectorize
tables entries, baring in mind that larger models entail more operational
complexity? We study the benefits of language models in 14 analytical tasks on
tables while varying the training size, as well as for a fuzzy join benchmark.
We introduce a simple characterization of a column that reveals two settings:
1) a dirty categories setting, where strings share much similarities across
entries, and conversely 2) a diverse entries setting. For dirty categories,
pretrained language models bring little-to-no benefit compared to simpler
string models. For diverse entries, we show that larger language models improve
data processing. For these we investigate the complexity-performance tradeoffs
and show that they reflect those of classic text embedding: larger models tend
to perform better, but it is useful to fine tune them for embedding purposes.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - TabLLM: Few-shot Classification of Tabular Data with Large Language
Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z) - Assessment of Massively Multilingual Sentiment Classifiers [7.852069123677559]
We present the biggest, unified, multilingual collection of sentiment analysis datasets.
We use these to assess 11 models and 80 high-quality sentiment datasets (out of 342 raw datasets collected) in 27 languages.
arXiv Detail & Related papers (2022-04-11T08:22:05Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.