How to Probe Sentence Embeddings in Low-Resource Languages: On
Structural Design Choices for Probing Task Evaluation
- URL: http://arxiv.org/abs/2006.09109v2
- Date: Wed, 28 Oct 2020 12:38:37 GMT
- Title: How to Probe Sentence Embeddings in Low-Resource Languages: On
Structural Design Choices for Probing Task Evaluation
- Authors: Steffen Eger and Johannes Daxenberger and Iryna Gurevych
- Abstract summary: We investigate sensitivity of probing task results to structural design choices.
We probe embeddings in a multilingual setup with design choices that lie in a'stable region', as we identify for English.
We find that results on English do not transfer to other languages.
- Score: 82.96358326053115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence encoders map sentences to real valued vectors for use in downstream
applications. To peek into these representations - e.g., to increase
interpretability of their results - probing tasks have been designed which
query them for linguistic knowledge. However, designing probing tasks for
lesser-resourced languages is tricky, because these often lack large-scale
annotated data or (high-quality) dependency parsers as a prerequisite of
probing task design in English. To investigate how to probe sentence embeddings
in such cases, we investigate sensitivity of probing task results to structural
design choices, conducting the first such large scale study. We show that
design choices like size of the annotated probing dataset and type of
classifier used for evaluation do (sometimes substantially) influence probing
outcomes. We then probe embeddings in a multilingual setup with design choices
that lie in a 'stable region', as we identify for English, and find that
results on English do not transfer to other languages. Fairer and more
comprehensive sentence-level probing evaluation should thus be carried out on
multiple languages in the future.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models? [17.011882550422452]
It is unknown whether the nature of the instruction data has an impact on the model output.
It is questionable whether translated test sets can capture such nuances.
We show that native or generation benchmarks reveal a notable difference between native and translated instruction data.
arXiv Detail & Related papers (2024-06-18T17:43:47Z) - Multilingual Few-Shot Learning via Language Model Retrieval [18.465566186549072]
Transformer-based language models have achieved remarkable success in few-shot in-context learning.
We conduct a study of retrieving semantically similar few-shot samples and using them as the context.
We evaluate the proposed method on five natural language understanding datasets related to intent detection, question classification, sentiment analysis, and topic classification.
arXiv Detail & Related papers (2023-06-19T14:27:21Z) - Idioms, Probing and Dangerous Things: Towards Structural Probing for
Idiomaticity in Vector Space [2.5288257442251107]
The goal of this paper is to learn more about how idiomatic information is structurally encoded in embeddings.
We perform a comparative probing study of static (GloVe) and contextual (BERT) embeddings.
Our experiments indicate that both encode some idiomatic information to varying degrees, but yield conflicting evidence as to whether idiomaticity is encoded in the vector norm.
arXiv Detail & Related papers (2023-04-27T17:06:20Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.