Related papers: Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval

Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval

URL: http://arxiv.org/abs/2505.12616v1
Date: Mon, 19 May 2025 01:58:22 GMT
Title: Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval
Authors: Shujauddin Syed, Ted Pedersen,
Abstract summary: This paper presents the approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval.<n>We implement a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents the Duluth approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78 on the development set and 0.69 on the test set across ten languages. Our system showed stronger performance on higher-resource languages but still lagged significantly behind the top-ranked system, which achieved 0.96 average success@10. Our findings suggest that though advanced neural architectures are increasingly dominant in multilingual retrieval tasks, properly optimized traditional methods like TF-IDF remain competitive baselines, especially in limited compute resource scenarios.

Related papers

ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting [0.0]
In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team.<n>Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon.<n>Our model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets.
arXiv Detail & Related papers (2025-07-24T19:38:15Z)
Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis [42.90274643419224]
The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation.<n>Experiments across 10 languages show significant improvements over baselines, achieving 0.90 accuracy for English and 0.84 for low-resource languages.
arXiv Detail & Related papers (2025-06-07T21:44:31Z)
Logits-Based Finetuning [48.18151583153572]
We propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation.<n>Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity.
arXiv Detail & Related papers (2025-05-30T10:57:09Z)
Word2winners at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval [0.7874708385247352]
This paper describes our system for SemEval 2025 Task 7: Previously Fact-Checked Claim Retrieval.<n>The task requires retrieving relevant fact-checks for a given input claim from the extensive, multilingual MultiClaim dataset.<n>Our best model achieved an accuracy of 85% on crosslingual data and 92% on monolingual data.
arXiv Detail & Related papers (2025-03-12T02:59:41Z)
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets.<n>Our approach emphasizes transparency, simplicity, and efficiency.<n>We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z)
Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models [0.0]
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse)
arXiv Detail & Related papers (2024-06-19T16:48:14Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
Strategies for improving low resource speech to text translation relying on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST) We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z)
Bag of Tricks for Effective Language Model Pretraining and Downstream Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard. GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference. With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z)
Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization. We show how to practically optimize this objective for large translation corpora using an iterated best response scheme. Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z)
An Attention Ensemble Approach for Efficient Text Classification of Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language. A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification. Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z)
Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR) AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities. Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
FiSSA at SemEval-2020 Task 9: Fine-tuned For Feelings [2.362412515574206]
In this paper, we present our approach for sentiment classification on Spanish-English code-mixed social media data. We explore both monolingual and multilingual models with the standard fine-tuning method. Although two-step fine-tuning improves sentiment classification performance over the base model, the large multilingual XLM-RoBERTa model achieves best weighted F1-score.
arXiv Detail & Related papers (2020-07-24T14:48:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.