Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models
- URL: http://arxiv.org/abs/2601.09059v1
- Date: Wed, 14 Jan 2026 01:02:06 GMT
- Title: Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models
- Authors: Santiago Martínez Novoa, Nicolás Rozo Fajardo, Diego Alejandro González Vargas, Nicolás Bedoya Figueroa,
- Abstract summary: This paper presents team Kl33n3x's multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task.<n>The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages.<n>The system achieved strong win rates across the competition's tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents team Kl33n3x's multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition's tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.
Related papers
- Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z) - Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning [9.373815852241648]
We employ two distinct knowledge transfer strategies to develop a reliable machine translation system for low-resource Indian languages.
For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages.
For Khasi (kh) and Mizo (mz), We trained a multilingual model as a baseline using bilingual data from these four language pairs, along with an additional about 8kw English-Bengali bilingual data.
arXiv Detail & Related papers (2024-09-24T08:53:19Z) - Cross-lingual transfer of multilingual models on low resource African Languages [0.20793001310272596]
monolingual models, trained on a single language, may better capture the nuances of the target language.
AfriBERT achieved the highest cross-lingual accuracy of 88.3% after fine-tuning.
BiGRU emerged as the best-performing neural model with 83.3% accuracy.
arXiv Detail & Related papers (2024-09-17T08:05:40Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - KInITVeraAI at SemEval-2023 Task 3: Simple yet Powerful Multilingual
Fine-Tuning for Persuasion Techniques Detection [0.0]
This paper presents the best-performing solution to the SemEval 2023 Task 3 on the subtask 3 dedicated to persuasion techniques detection.
Due to a high multilingual character of the input data and a large number of 23 predicted labels, we opted for fine-tuning pre-trained transformer-based language models.
arXiv Detail & Related papers (2023-04-24T09:06:43Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Multitask Finetuning for Improving Neural Machine Translation in Indian
Languages [0.0]
Transformer based language models have led to impressive results across all domains in Natural Language Processing.
We propose a Multitask Finetuning methodology which combines the Bilingual Machine Translation task with an auxiliary Causal Language Modeling task to improve performance.
arXiv Detail & Related papers (2021-12-03T06:43:56Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.