Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation
- URL: http://arxiv.org/abs/2412.09045v1
- Date: Thu, 12 Dec 2024 08:13:32 GMT
- Title: Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation
- Authors: Xuebin Wang, Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong, Yang Hou,
- Abstract summary: We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data.
Based on analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries.
We have annotated about 1,000 sentences as the evaluation data of AISHELL2.
- Score: 22.154557929773617
- License:
- Abstract: Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from speech-text parallel data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.
Related papers
- Scalable and Domain-General Abstractive Proposition Segmentation [20.532804009152255]
We focus on the task of abstractive proposition segmentation (APS): transforming text into simple, self-contained, well-formed sentences.
We first introduce evaluation metrics for the task to measure several dimensions of quality.
We then propose a scalable, yet accurate, proposition segmentation model.
arXiv Detail & Related papers (2024-06-28T10:24:31Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Augmenting text for spoken language understanding with Large Language
Models [13.240782495441275]
We show how to use transcript-semantic parse data (unpaired text) without corresponding speech.
Experiments show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively.
We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains.
arXiv Detail & Related papers (2023-09-17T22:25:34Z) - Mining Word Boundaries in Speech as Naturally Annotated Word
Segmentation Data [41.494096583913105]
Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), this work proposes to mine word boundaries from parallel speech/text data.
First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments.
We obtain character-level alignments and design simple rules for determining word boundaries according to pause duration between adjacent characters.
arXiv Detail & Related papers (2022-10-31T08:02:21Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Extracting and filtering paraphrases by bridging natural language
inference and paraphrasing [0.0]
We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets.
The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
arXiv Detail & Related papers (2021-11-13T14:06:37Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.