Polling Latent Opinions: A Method for Computational Sociolinguistics
Using Transformer Language Models
- URL: http://arxiv.org/abs/2204.07483v2
- Date: Tue, 19 Apr 2022 18:09:39 GMT
- Title: Polling Latent Opinions: A Method for Computational Sociolinguistics
Using Transformer Language Models
- Authors: Philip Feldman, Aaron Dant, James R. Foulds, Shemei Pan
- Abstract summary: We use the capacity for memorization and extrapolation of Transformer Language Models to learn the linguistic behaviors of a subgroup within larger corpora of Yelp reviews.
We show that even in cases where a specific keyphrase is limited or not present at all in the training corpora, the GPT is able to accurately generate large volumes of text that have the correct sentiment.
- Score: 4.874780144224057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text analysis of social media for sentiment, topic analysis, and other
analysis depends initially on the selection of keywords and phrases that will
be used to create the research corpora. However, keywords that researchers
choose may occur infrequently, leading to errors that arise from using small
samples. In this paper, we use the capacity for memorization, interpolation,
and extrapolation of Transformer Language Models such as the GPT series to
learn the linguistic behaviors of a subgroup within larger corpora of Yelp
reviews. We then use prompt-based queries to generate synthetic text that can
be analyzed to produce insights into specific opinions held by the populations
that the models were trained on. Once learned, more specific sentiment queries
can be made of the model with high levels of accuracy when compared to
traditional keyword searches. We show that even in cases where a specific
keyphrase is limited or not present at all in the training corpora, the GPT is
able to accurately generate large volumes of text that have the correct
sentiment.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Towards Human Understanding of Paraphrase Types in ChatGPT [7.662751948664846]
Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes.
We introduce APTY (Atomic Paraphrase TYpes), a dataset of 500 sentence-level and word-level annotations by 15 annotators.
Our results reveal that ChatGPT can generate simple APTs, but struggle with complex structures.
arXiv Detail & Related papers (2024-07-02T14:35:10Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - ChatGPT as a Text Simplification Tool to Remove Bias [0.0]
The presence of specific linguistic signals particular to a certain sub-group can be picked up by language models during training.
We explore a potential technique for bias mitigation in the form of simplification of text.
arXiv Detail & Related papers (2023-05-09T13:10:23Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Corpus-Based Paraphrase Detection Experiments and Review [0.0]
Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, etc.
In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection.
arXiv Detail & Related papers (2021-05-31T23:29:24Z) - Text Mining for Processing Interview Data in Computational Social
Science [0.6820436130599382]
We use commercially available text analysis technology to process interview text data from a computational social science study.
We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses.
We encourage studies in social science to use text analysis, especially for exploratory open-ended studies.
arXiv Detail & Related papers (2020-11-28T00:44:35Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - An Empirical Investigation of Pre-Trained Transformer Language Models
for Open-Domain Dialogue Generation [23.343006562849126]
We present an empirical investigation of pre-trained Transformer-based auto-regressive language models for the task of open-domain dialogue generation.
Training paradigm of pre-training and fine-tuning is employed to conduct learning.
Experiments are conducted on the typical single-turn and multi-turn dialogue corpora such as Weibo, Douban, Reddit, DailyDialog, and Persona-Chat.
arXiv Detail & Related papers (2020-03-09T15:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.