Fugu-MT 論文翻訳(概要): PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

論文の概要: PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

arxiv url: http://arxiv.org/abs/2305.07893v2
Date: Fri, 29 Sep 2023 16:12:29 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-02 18:58:31.397697
Title: PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity
Title（参考訳）: PESTS: セマンティックテキスト類似性のためのペルシャ英語クロスリンガルコーパス
Authors: Mohammad Abdous, Poorya Piroozfar, Behrouz Minaei Bidgoli
Abstract要約: 言語間セマンティック類似性モデルでは、言語間セマンティック類似性データセットが利用できないため、機械翻訳を用いる。ペルシャ語は低資源言語の1つであり、二つの言語の文脈を理解できるモデルの必要性は、これまで以上に感じられる。本稿では,ペルシア語と英語の文間の意味的類似性のコーパスを,言語専門家を用いて初めて作成した。
参考スコア（独自算出の注目度）: 6.113459147063378
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.
Abstract（参考訳）: 最近多くの調査を受けた自然言語処理のコンポーネントの1つは、セマンティックテキストの類似性である。計算言語学や自然言語処理では、単語、句、段落、テキストの意味的類似性を評価することが重要である。意味的類似性(semantic similarity)は、単言語版とクロス言語版の両方で提供される2つのテキスト片、段落、句間の意味的類似度を計算することである。言語間の意味的類似性は、ソース言語とターゲット言語の両方に意味的類似度を持つ文対が存在するコーパスを必要とする。多くの既存の言語間セマンティック類似モデルでは、機械翻訳誤差の伝搬がモデルの精度を低下させるクロス言語間セマンティック類似性データセットが利用できないため、機械翻訳を用いる。一方、機械翻訳に意味的類似性を利用したい場合は、意味的類似性のために同じ機械翻訳を使うべきではない。ペルシャ語は低資源言語の1つであるが、この点において努力は行われておらず、2つの言語の文脈を理解できるモデルの必要性はこれまで以上に感じられる。本稿では,ペルシア語と英語の文間の意味的テキスト類似性のコーパスを,言語専門家を用いて初めて作成した。このデータセットをPESTS (Persian English Semantic Textual similarity) と名付けた。このコーパスは5375の文対を含む。また、トランスフォーマーに基づくモデルもこのデータセットを使って微調整されている。その結果、PESTSデータセットを用いて、XLM ROBERTaモデルのピアソン相関は85.87%から95.62%に増加した。

関連論文リスト

Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning [0.6599344783327054]
機械翻訳データの学習が小英語モデルに与える影響について検討する。我々は、24のタイポロジーおよびリソース多様性ソース言語から翻訳された英語のテキストでモデルを訓練する。
論文参考訳（メタデータ） (2026-02-18T13:59:08Z)
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
多言語コーパスで訓練されたサブワードトークンライザは、言語間で重複するトークンを自然に生成する。トークンの重複は言語間転送を促進するのか、それとも言語間の干渉を導入するのか? 相反する語彙を持つモデルでは、重なり合う結果が得られます。
論文参考訳（メタデータ） (2025-09-23T07:47:54Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
事前訓練された言語モデル(LM)における言語間アライメントは、テキストベースのLMの効率的な転送を可能にしている。テキストに基づく言語間アライメントの発見と手法が音声に適用されるかどうかについては、未解決のままである。
論文参考訳（メタデータ） (2025-05-26T07:21:20Z)
Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
エンコーダのみの多言語言語モデル(mLM)におけるサブワード間の共有セマンティクスの役割を測る第一歩を踏み出した。意味的に類似したサブワードとその埋め込みをマージして「意味トークン」を形成する。グループ化されたサブワードの検査では様々な意味的類似性を示します
論文参考訳（メタデータ） (2024-11-07T08:38:32Z)
FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
本稿では,ソーシャルメディアからペルシャの非公式短文間の意味的類似性を測定するための,トランスフォーマーに基づく新しいモデルを提案する。これは、約9900万のペルシア語の非公式な短文をソーシャルネットワークから事前訓練しており、ペルシア語の一種である。提案手法はPearsonとSpearmanの係数基準でParsBERT, laBSE, multilingual BERTより優れていた。
論文参考訳（メタデータ） (2024-07-27T05:04:49Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
このような相互作用を探索するためのケーススタディとして,作用性のセマンティックな概念を提示する。これは、LMが言語アノテーション、理論テスト、発見のためのより有用なツールとして役立つ可能性を示唆している。
論文参考訳（メタデータ） (2023-05-29T16:24:01Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
本稿では,多言語テキスト埋め込み学習のための生成モデルを提案する。我々のモデルは、$N$言語で並列データを操作する。本手法は, 意味的類似性, ビットクストマイニング, 言語間質問検索などを含む一連のタスクに対して評価を行う。
論文参考訳（メタデータ） (2022-12-21T02:41:40Z)
Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
自然言語処理(NLP)では、一般に、融合や凝集のような厳密な形態を持つ言語全体をラベル付けする。本研究では,単語とセグメントレベルで形態型を定量化することにより,そのようなクレームの剛性を低減することを提案する。本研究では, 英語, ドイツ語, トルコ語の非教師なし・教師付き形態素分割法について検討する一方, 融合ではスペイン語を用いた半自動手法を提案する。そして、機械翻訳品質と単語(名詞と動詞)における合成・融合の程度との関係を分析する。
論文参考訳（メタデータ） (2022-05-06T17:04:58Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
言語間モデルでは、多くの異なる言語に対する表現は同じ空間に存在している。我々は,bitext検索性能の形式で,言語間アライメントのタスクベース尺度を計算した。我々はこれらのアライメント指標の潜在的な予測因子として言語的、準言語的、および訓練関連の特徴について検討する。
論文参考訳（メタデータ） (2021-09-13T21:05:37Z)
Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast [12.691501386854094]
本稿では,異なる言語からの文表現を,単純なドット積で意味的類似性を計算可能な統合埋め込み空間に整合させることを提案する。実験結果が示すように,本モデルが生成した文表現は,複数のタスクにおいて新たな最先端を実現する。
論文参考訳（メタデータ） (2021-09-01T08:48:34Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
そこで本稿では,小さな並列コーパスに基づく文脈型言語間単語埋め込み学習手法を提案する。本手法は,入力文の翻訳と再構成を同時に行うLSTMエンコーダデコーダモデルを用いて単語埋め込みを実現する。
論文参考訳（メタデータ） (2020-10-27T22:24:01Z)
A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards [40.17497211507507]
言語間テキスト要約は、実際は重要だが未探索の課題である。本稿では,エンドツーエンドのテキスト要約モデルを提案する。
論文参考訳（メタデータ） (2020-06-27T21:51:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。