Fugu-MT 論文翻訳(概要): Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

論文の概要: Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

arxiv url: http://arxiv.org/abs/2508.06435v1
Date: Fri, 08 Aug 2025 16:23:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.302724
Title: Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages
Title（参考訳）: 言語ではなくトピックを学ぶ: LLMはいかにして、言語横断のオンライン移民論を分類するか
Authors: Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain,
Abstract要約: 大規模言語モデル(LLM)は、スケーラブルで正確な分析を可能にすることによって、社会科学の研究を変革している。我々は、移民関連ツイートを分類するために、モノリンガル、バイリンガル、マルチリンガルデータセット上の軽量LLaMA 3.2-3Bモデルを微調整する。最小限の言語固有の微調整が言語間話題の検出を可能にするか、ターゲット言語を追加することで事前学習バイアスが修正されるかを評価する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as $9.62\times10^{-11}$ of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.
Abstract（参考訳）: 大規模言語モデル(LLM)は、スケーラブルで正確な分析を可能にすることによって、社会科学の研究を変革している。これらの適応性は、いくつかの言語で微調整によって得られた知識が、事前学習中にのみ現れる見知らぬ言語に移行できるかどうかという問題を提起する。そこで我々は,13言語にわたるX/Twitterからの移民関連ツイートを分類するために,モノリンガル,バイリンガル,マルチリンガルのデータセットを用いた軽量LLaMA 3.2-3Bモデルを微調整した。最小限の言語固有の微調整が言語間話題の検出を可能にするか、ターゲット言語を追加することで事前学習バイアスが修正されるかを評価する。その結果,LLMを1つか2つの言語で微調整することで,移民関連コンテンツを不明瞭な言語で確実に分類できることが示唆された。しかし、ツイートがアンチ移民の姿勢を表すかどうかを特定することは、多言語微調整の恩恵を受ける。事前学習バイアスは支配的な言語を好むが、微調整中(9.62\times10^{-11}$)にあまり表現されていない言語への露出は最小限に抑えられる。これらの知見は、言語間の熟達には広範囲にわたる多言語訓練が必要であるという仮定に疑問を投げかけ、言語カバレッジの制限はトピックレベルの一般化に十分であり、構造バイアスは軽量な介入によって修正できる。 4ビット量子化されたLoRAファインチューニングモデルをリリースすることにより、オープンAI GPT-4oモデルのコストのわずか0.00000989%で35倍高速な推論を提供する、プロプライエタリなLLMに代わる、オープンソースで再現可能な代替品を提供し、スケーラブルで包括的な研究を可能にします。

論文の概要: Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

関連論文リスト