Text sampling strategies for predicting missing bibliographic links
- URL: http://arxiv.org/abs/2301.01673v1
- Date: Wed, 4 Jan 2023 15:53:50 GMT
- Title: Text sampling strategies for predicting missing bibliographic links
- Authors: F. V. Krasnova, I. S. Smaznevicha, E. N. Baskakova
- Abstract summary: The paper proposes various strategies for sampling text when performing automatic sentence classification.
We examine a number of sampling strategies that differ in context size and position.
This method of detecting missing links can be used in recommendation engines of applied intelligent information systems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The paper proposes various strategies for sampling text data when performing
automatic sentence classification for the purpose of detecting missing
bibliographic links. We construct samples based on sentences as semantic units
of the text and add their immediate context which consists of several
neighboring sentences. We examine a number of sampling strategies that differ
in context size and position. The experiment is carried out on the collection
of STEM scientific papers. Including the context of sentences into samples
improves the result of their classification. We automatically determine the
optimal sampling strategy for a given text collection by implementing an
ensemble voting when classifying the same data sampled in different ways.
Sampling strategy taking into account the sentence context with hard voting
procedure leads to the classification accuracy of 98% (F1-score). This method
of detecting missing bibliographic links can be used in recommendation engines
of applied intelligent information systems.
Related papers
- Ordered Semantically Diverse Sampling for Textual Data [6.280814487955095]
We introduce the ordered diverse sampling problem based on a new metric that measures the diversity in an ordered list of samples.
We present a novel approach for generating ordered diverse samples for textual data that uses principal components on the embedding vectors.
arXiv Detail & Related papers (2025-03-12T06:38:57Z) - Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.
Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution [1.3654846342364308]
State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce.
We propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task.
We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection.
arXiv Detail & Related papers (2024-05-09T12:03:38Z) - Bilevel Scheduled Sampling for Dialogue Generation [6.89978591161039]
We propose a bilevel scheduled sampling model that takes the sentence-level information into account and incorporates it with word-level quality.
Experiments conducted on the DailyDialog and PersonaChat datasets demonstrate the effectiveness of our proposed methods.
arXiv Detail & Related papers (2023-09-05T05:05:06Z) - Identifying Semantically Difficult Samples to Improve Text
Classification [4.545971444299925]
We investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task.
We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space.
We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9%.
arXiv Detail & Related papers (2023-02-13T07:33:46Z) - Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised
Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution.
Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences.
We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - Constructing Contrastive samples via Summarization for Text
Classification with limited annotations [46.53641181501143]
We propose a novel approach to constructing contrastive samples for language tasks using text summarization.
We use these samples for supervised contrastive learning to gain better text representations with limited annotations.
Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News) demonstrate the effectiveness of the proposed contrastive learning framework.
arXiv Detail & Related papers (2021-04-11T20:13:24Z) - An Unsupervised Sampling Approach for Image-Sentence Matching Using
Document-Level Structural Information [64.66785523187845]
We focus on the problem of unsupervised image-sentence matching.
Existing research explores to utilize document-level structural information to sample positive and negative instances for model training.
We propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples.
arXiv Detail & Related papers (2021-03-21T05:43:29Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.