Related papers: Improving Tagging Consistency and Entity Coverage for Chemical Identification in Full-text Articles

Improving Tagging Consistency and Entity Coverage for Chemical Identification in Full-text Articles

URL: http://arxiv.org/abs/2111.10584v1
Date: Sat, 20 Nov 2021 13:13:58 GMT
Title: Improving Tagging Consistency and Entity Coverage for Chemical Identification in Full-text Articles
Authors: Hyunjae Kim, Mujeen Sung, Wonjin Yoon, Sungjoon Park, Jaewoo Kang
Abstract summary: This paper is a technical report on our system submitted to the chemical identification task of the BioCreative VII Track 2 challenge. We aim to improve tagging consistency and entity coverage using various methods. In the official evaluation of the challenge, our system was ranked 1st in NER by significantly outperforming the baseline model.
Score: 17.24298646089662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper is a technical report on our system submitted to the chemical identification task of the BioCreative VII Track 2 challenge. The main feature of this challenge is that the data consists of full-text articles, while current datasets usually consist of only titles and abstracts. To effectively address the problem, we aim to improve tagging consistency and entity coverage using various methods such as majority voting within the same articles for named entity recognition (NER) and a hybrid approach that combines a dictionary and a neural model for normalization. In the experiments on the NLM-Chem dataset, we show that our methods improve models' performance, particularly in terms of recall. Finally, in the official evaluation of the challenge, our system was ranked 1st in NER by significantly outperforming the baseline model and more than 80 submissions from 16 teams.

Related papers

Segment Concealed Objects with Incomplete Supervision [63.637733655439334]
Incompletely-Supervised Concealed Object (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments.<n>This task remains highly challenging due to the limited supervision provided by the incompletely annotated training data.<n>In this paper, we introduce the first unified method for ISCOS to address these challenges.
arXiv Detail & Related papers (2025-06-10T16:25:15Z)
Enhancing Abstractive Summarization of Scientific Papers Using Structure Information [6.414732533433283]
We propose a two-stage abstractive summarization framework that leverages automatic recognition of structural functions within scientific papers.<n>In the first stage, we standardize chapter titles from numerous scientific papers and construct a large-scale dataset for structural function recognition.<n>In the second stage, we employ Longformer to capture rich contextual relationships across sections and generating context-aware summaries.
arXiv Detail & Related papers (2025-05-20T10:34:45Z)
Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs [1.6575279044457722]
This paper proposes a Clustering, Labeling, then Augmenting framework that enhances performance in Semi-Supervised Text Classification tasks. Unlike traditional SSTC approaches, this framework employs clustering to select representative "landmarks" for labeling. Empirical results show that even in complex text document classification scenarios involving over 100 categories, our method achieves state-of-the-art accuracies of 95.41% on the Reuters dataset and 82.43% on the Web of Science dataset.
arXiv Detail & Related papers (2024-11-09T13:17:39Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
SUMIE: A Synthetic Benchmark for Incremental Entity Summarization [6.149024468471498]
No existing dataset adequately tests how well language models can incrementally update entity summaries. We introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively highlights problems like incorrect entity association and incomplete information presentation.
arXiv Detail & Related papers (2024-06-07T16:49:21Z)
Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation [5.3558730908641525]
We propose a first benchmark dataset, CAMERA, to standardize the task of ATG. Our experiments show the current state and the remaining challenges. We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.
arXiv Detail & Related papers (2023-09-21T12:51:24Z)
NER-to-MRC: Named-Entity Recognition Completely Solving as Machine Reading Comprehension [29.227500985892195]
We frame NER as a machine reading comprehension problem, called NER-to-MRC. We transform the NER task into a form suitable for the model to solve with MRC in a efficient manner. We achieve state-of-the-art performance without external data, up to 11.24% improvement on the WNUT-16 dataset.
arXiv Detail & Related papers (2023-05-06T08:05:22Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency [14.974996886744083]
We release SummFC, a filtered summarization dataset with improved factual consistency. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.
arXiv Detail & Related papers (2022-10-31T15:04:20Z)
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks. We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
Chemical Identification and Indexing in PubMed Articles via BERT and Text-to-Text Approaches [3.7462395049372894]
The Biocreative VII Track-2 challenge consists of named entity recognition, entity-linking (or entity-normalization), and topic indexing tasks. We achieve our best performance with BERT-based BioMegatron models. In addition to conventional NER methods, we attempt both named entity recognition and entity linking with a novel text-to-text or "prompt" based method.
arXiv Detail & Related papers (2021-11-30T18:21:06Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training [66.80558875393565]
We study the problem of training named entity recognition (NER) models using only distantly-labeled data. We propose a noise-robust learning scheme comprised of a new loss function and a noisy label removal step. Our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
arXiv Detail & Related papers (2021-09-10T17:19:56Z)
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.