Related papers: Evaluation of Word Embeddings for the Social Sciences

Evaluation of Word Embeddings for the Social Sciences

URL: http://arxiv.org/abs/2302.06174v1
Date: Mon, 13 Feb 2023 08:23:03 GMT
Title: Evaluation of Word Embeddings for the Social Sciences
Authors: Ricardo Schiffers, Dagmar Kern, Daniel Hienert
Abstract summary: We describe the creation and evaluation of word embedding models based on 37,604 social science research papers. We found that the created domain-specific model covers a large part of social science concepts. Across all relation types, we found a more extensive coverage of semantic relationships.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Word embeddings are an essential instrument in many NLP tasks. Most available resources are trained on general language from Web corpora or Wikipedia dumps. However, word embeddings for domain-specific language are rare, in particular for the social science domain. Therefore, in this work, we describe the creation and evaluation of word embedding models based on 37,604 open-access social science research papers. In the evaluation, we compare domain-specific and general language models for (i) language coverage, (ii) diversity, and (iii) semantic relationships. We found that the created domain-specific model, even with a relatively small vocabulary size, covers a large part of social science concepts, their neighborhoods are diverse in comparison to more general models. Across all relation types, we found a more extensive coverage of semantic relationships.

Related papers

Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish. We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z)
Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs) We form "semantic tokens" by merging the semantically similar subwords and their embeddings. inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z)
Domain-specific long text classification from sparse relevant information [3.3611255314174815]
We propose a hierarchical model which exploits a short list of potential target terms to retrieve candidate sentences. A pooling of the term(s) embedding(s) entails the document representation to be classified. We show that our narrower hierarchical model is better than larger language models for retrieving relevant long documents in a domain-specific context.
arXiv Detail & Related papers (2024-08-23T17:54:19Z)
LexGen: Domain-aware Multilingual Lexicon Generation [40.97738267067852]
We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information. We release a new benchmark dataset across 6 Indian languages that span 8 diverse domains.
arXiv Detail & Related papers (2024-05-18T07:02:43Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
Domain-Specific Word Embeddings with Structure Prediction [3.057136788672694]
We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy. Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests. As a use case in the field of Digital Humanities we demonstrate how to raise novel research questions for high literature from the German Text Archive.
arXiv Detail & Related papers (2022-10-06T12:45:48Z)
Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy. We present a new method that allows achieving high results on this task with little effort. We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin [0.015863809575305417]
We build Word2Vec and Poincar'e word embedding models for Fon and Nobiin. Our main contribution is to arouse more interest in creating word embedding models proper to African Languages.
arXiv Detail & Related papers (2021-03-08T22:58:20Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
Computational linguistic assessment of textbook and online learning media by means of threshold concepts in business education [59.003956312175795]
From a linguistic perspective, threshold concepts are instances of specialized vocabularies, exhibiting particular linguistic features. The profiles of 63 threshold concepts from business education have been investigated in textbooks, newspapers, and Wikipedia. The three kinds of resources can indeed be distinguished in terms of their threshold concepts' profiles.
arXiv Detail & Related papers (2020-08-05T12:56:16Z)
Comparative Analysis of Word Embeddings for Capturing Word Similarities [0.0]
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.
arXiv Detail & Related papers (2020-05-08T01:16:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.