Mining Word Boundaries in Speech as Naturally Annotated Word
Segmentation Data
- URL: http://arxiv.org/abs/2210.17122v2
- Date: Mon, 30 Oct 2023 11:11:46 GMT
- Title: Mining Word Boundaries in Speech as Naturally Annotated Word
Segmentation Data
- Authors: Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong, Zhefeng Wang, Baoxing
Huai, Min Zhang
- Abstract summary: Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), this work proposes to mine word boundaries from parallel speech/text data.
First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments.
We obtain character-level alignments and design simple rules for determining word boundaries according to pause duration between adjacent characters.
- Score: 41.494096583913105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by early research on exploring naturally annotated data for Chinese
word segmentation (CWS), and also by recent research on integration of speech
and text processing, this work for the first time proposes to mine word
boundaries from parallel speech/text data. First we collect parallel
speech/text data from two Internet sources that are related with CWS data used
in our experiments. Then, we obtain character-level alignments and design
simple heuristic rules for determining word boundaries according to pause
duration between adjacent characters. Finally, we present an effective
complete-then-train strategy that can better utilize extra naturally annotated
data for model training. Experiments demonstrate our approach can significantly
boost CWS performance in both cross-domain and low-resource scenarios.
Related papers
- Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Suffix Retrieval-Augmented Language Modeling [1.8710230264817358]
Causal language modeling (LM) uses word history to predict the next word.
BERT, on the other hand, makes use of bi-directional word information in a sentence to predict words at masked positions.
We propose a novel model that simulates a bi-directional contextual effect in an autoregressive manner.
arXiv Detail & Related papers (2022-11-06T07:53:19Z) - Knowing Where and What: Unified Word Block Pretraining for Document
Understanding [11.46378901674016]
We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
arXiv Detail & Related papers (2022-07-28T09:43:06Z) - Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues
and Documents [13.755637074366813]
SummN is a simple, flexible, and effective multi-stage framework for input texts longer than the maximum context lengths of typical pretrained LMs.
It can process input text of arbitrary length by adjusting the number of stages while keeping the LM context size fixed.
Our experiments demonstrate that SummN significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-10-16T06:19:54Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.