Personalized Jargon Identification for Enhanced Interdisciplinary
Communication
- URL: http://arxiv.org/abs/2311.09481v1
- Date: Thu, 16 Nov 2023 00:51:25 GMT
- Title: Personalized Jargon Identification for Enhanced Interdisciplinary
Communication
- Authors: Yue Guo, Joseph Chee Chang, Maria Antoniak, Erin Bransom, Trevor
Cohen, Lucy Lu Wang, Tal August
- Abstract summary: Current methods of jargon identification mainly use corpus-level familiarity indicators.
We collect a dataset of over 10K term familiarity annotations from 11 computer science researchers.
We investigate features representing individual, sub-domain, and domain knowledge to predict individual jargon familiarity.
- Score: 22.999616448996303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific jargon can impede researchers when they read materials from other
domains. Current methods of jargon identification mainly use corpus-level
familiarity indicators (e.g., Simple Wikipedia represents plain language).
However, researchers' familiarity of a term can vary greatly based on their own
background. We collect a dataset of over 10K term familiarity annotations from
11 computer science researchers for terms drawn from 100 paper abstracts.
Analysis of this data reveals that jargon familiarity and information needs
vary widely across annotators, even within the same sub-domain (e.g., NLP). We
investigate features representing individual, sub-domain, and domain knowledge
to predict individual jargon familiarity. We compare supervised and
prompt-based approaches, finding that prompt-based methods including personal
publications yields the highest accuracy, though zero-shot prompting provides a
strong baseline. This research offers insight into features and methods to
integrate personal data into scientific jargon identification.
Related papers
- De-jargonizing Science for Journalists with GPT-4: A Pilot Study [3.730699089967391]
The system achieves fairly high recall in identifying jargon and preserves relative differences in readers' jargon identification.
The findings highlight the potential of generative AI for assisting science reporters, and can inform future work on developing tools to simplify dense documents.
arXiv Detail & Related papers (2024-10-15T21:10:01Z) - BLADE: Benchmarking Language Model Agents for Data-Driven Science [18.577658530714505]
LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science.
We present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions.
arXiv Detail & Related papers (2024-08-19T02:59:35Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - Understanding metric-related pitfalls in image analysis validation [59.15220116166561]
This work provides the first comprehensive common point of access to information on pitfalls related to validation metrics in image analysis.
Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy.
arXiv Detail & Related papers (2023-02-03T14:57:40Z) - Author Name Disambiguation via Heterogeneous Network Embedding from
Structural and Semantic Perspectives [13.266320447769564]
Name ambiguity is common in academic digital libraries, such as multiple authors having the same name.
The proposed method is mainly based on representation learning for heterogeneous networks and clustering.
The semantic representation is generated using NLP tools.
arXiv Detail & Related papers (2022-12-24T11:22:34Z) - SciTweets -- A Dataset and Annotation Framework for Detecting Scientific
Online Discourse [2.3371548697609303]
Scientific topics, claims and resources are increasingly debated as part of online discourse.
This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines.
Research across disciplines currently suffers from a lack of robust definitions of the various forms of science-relatedness.
arXiv Detail & Related papers (2022-06-15T08:14:55Z) - LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Domain Generalization: A Survey [146.68420112164577]
Domain generalization (DG) aims to achieve OOD generalization by only using source domain data for model learning.
For the first time, a comprehensive literature review is provided to summarize the ten-year development in DG.
arXiv Detail & Related papers (2021-03-03T16:12:22Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.