TartuNLP at SemEval-2025 Task 5: Subject Tagging as Two-Stage Information Retrieval
- URL: http://arxiv.org/abs/2504.21547v1
- Date: Wed, 30 Apr 2025 11:44:08 GMT
- Title: TartuNLP at SemEval-2025 Task 5: Subject Tagging as Two-Stage Information Retrieval
- Authors: Aleksei Dorkin, Kairit Sirts,
- Abstract summary: We present our submission to the Task 5 of SemEval-2025.<n>This task aims to aid librarians in assigning subject tags to the library records by producing a list of likely relevant tags for a given document.<n>We leverage two types of encoder models to build a two-stage information retrieval system.
- Score: 0.21485350418225246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present our submission to the Task 5 of SemEval-2025 that aims to aid librarians in assigning subject tags to the library records by producing a list of likely relevant tags for a given document. We frame the task as an information retrieval problem, where the document content is used to retrieve subject tags from a large subject taxonomy. We leverage two types of encoder models to build a two-stage information retrieval system -- a bi-encoder for coarse-grained candidate extraction at the first stage, and a cross-encoder for fine-grained re-ranking at the second stage. This approach proved effective, demonstrating significant improvements in recall compared to single-stage methods and showing competitive results according to qualitative evaluation.
Related papers
- Learning to Retrieve with Weakened Labels: Robust Training under Label Noise [0.0]
We consider a label weakening approach to generate robust retrieval models in the presence of label noise.<n>Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.
arXiv Detail & Related papers (2025-12-15T11:52:13Z) - Homa at SemEval-2025 Task 5: Aligning Librarian Records with OntoAligner for Subject Tagging [1.2582887633807602]
This paper presents our system, Homa, for SemEval-2025 Task 5: Subject Tagging.<n>It focuses on automatically assigning subject labels to technical records from TIBKAT using the Gemeinsame Normdatei (GND) taxonomy.<n>Our approach formulates the subject tagging problem as an alignment task, where records are matched to categories based on semantic similarity.
arXiv Detail & Related papers (2025-04-30T09:52:51Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - Leveraging Semantic Segmentation Masks with Embeddings for Fine-Grained Form Classification [0.0]
Efficient categorization of historical documents is crucial for fields such as genealogy, legal research and historical scholarship.
We propose a representational learning strategy that integrates deep learning models such as ResNet, masked Image Transformer (Di), and embedding segmentation.
arXiv Detail & Related papers (2024-05-23T04:28:50Z) - Cross Encoding as Augmentation: Towards Effective Educational Text
Classification [9.786833703453741]
We propose a novel retrieval approach CEAA that provides effective learning in educational text classification.
Our main contributions are as follows: 1) we leverage transfer learning from question-answering datasets, and 2) we propose a simple but effective data augmentation method.
arXiv Detail & Related papers (2023-05-30T12:19:30Z) - Retrieval-augmented Multi-label Text Classification [20.100081284294973]
Multi-label text classification is a challenging task in settings of large label sets.
Retrieval augmentation aims to improve the sample efficiency of classification models.
We evaluate this approach on four datasets from the legal and biomedical domains.
arXiv Detail & Related papers (2023-05-22T14:16:23Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - Focused Attention Improves Document-Grounded Generation [111.42360617630669]
Document grounded generation is the task of using the information provided in a document to improve text generation.
This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation.
arXiv Detail & Related papers (2021-04-26T16:56:29Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - Summary-Source Proposition-level Alignment: Task, Datasets and
Supervised Baseline [94.0601799665342]
Aligning sentences in a reference summary with their counterparts in source documents was shown as a useful auxiliary summarization task.
We propose establishing summary-source alignment as an explicit task, while introducing two major novelties.
We create a novel training dataset for proposition-level alignment, derived automatically from available summarization evaluation data.
We present a supervised proposition alignment baseline model, showing improved alignment-quality over the unsupervised approach.
arXiv Detail & Related papers (2020-09-01T17:27:12Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.