Related papers: Software Entity Recognition with Noise-Robust Learning

Software Entity Recognition with Noise-Robust Learning

URL: http://arxiv.org/abs/2308.10564v1
Date: Mon, 21 Aug 2023 08:41:46 GMT
Title: Software Entity Recognition with Noise-Robust Learning
Authors: Tai Nguyen, Yifeng Di, Joohan Lee, Muhao Chen and Tianyi Zhang
Abstract summary: We leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types. We then propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks.
Score: 31.259250137320468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks. We release our models, data, and code for future research.

Related papers

RESTOR: Knowledge Recovery through Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can memorize undesirable datapoints. Many machine unlearning methods have been proposed that aim to 'erase' these datapoints from trained models. We propose the RESTOR framework for machine unlearning based on the following dimensions.
arXiv Detail & Related papers (2024-10-31T20:54:35Z)
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept [5.345828824625758]
We propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs) By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model's performance on unrelated data.
arXiv Detail & Related papers (2024-10-08T10:26:22Z)
MUSE: Machine Unlearning Six-Way Evaluation for Language Models [109.76505405962783]
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. We propose MUSE, a comprehensive machine unlearning evaluation benchmark. We benchmark how effectively eight popular unlearning algorithms can unlearn Harry Potter books and news articles.
arXiv Detail & Related papers (2024-07-08T23:47:29Z)
Generative AI for Software Metadata: Overview of the Information Retrieval in Software Engineering Track at FIRE 2023 [18.616716369775883]
The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments. The dataset consists of 9048 code comments and surrounding code snippet pairs extracted from open source C based projects. The labels generated from large language models increase the bias in the prediction model but lead to less over-fitted results.
arXiv Detail & Related papers (2023-10-27T14:13:23Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
Detecting Requirements Smells With Deep Learning: Experiences, Challenges and Future Work [9.44316959798363]
This work aims to improve the previous work by creating a manually labeled dataset and using ensemble learning, Deep Learning (DL), and techniques such as word embeddings and transfer learning to overcome the generalization problem. The current findings show that the dataset is unbalanced and which class examples should be added more.
arXiv Detail & Related papers (2021-08-06T12:45:15Z)
A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations? [21.562089974755125]
Several approaches have been proposed to improve the training of deep learning models in the presence of noisy labels. This paper presents a survey on the main techniques in literature, in which we classify the algorithm in the following groups: robust losses, sample weighting, sample selection, meta-learning, and combined approaches.
arXiv Detail & Related papers (2020-12-05T15:45:20Z)
Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models. Self-training serves as an effective mechanism to learn from large amounts of unlabeled data. meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z)
Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process. Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved. We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.