CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale
and High Quality
- URL: http://arxiv.org/abs/2306.11477v1
- Date: Tue, 20 Jun 2023 12:02:26 GMT
- Title: CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale
and High Quality
- Authors: Liang Li, Ruiying Geng, Chengyang Fang, Bing Li, Can Ma, Rongyu Cao,
Binhua Li, Fei Huang, Yongbin Li
- Abstract summary: We present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality.
The dataset aims to generate textual descriptions for the answer in the practical TableQA system.
We propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources.
- Score: 42.246771022648765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are three problems existing in the popular data-to-text datasets.
First, the large-scale datasets either contain noise or lack real application
scenarios. Second, the datasets close to real applications are relatively small
in size. Last, current datasets bias in the English language while leaving
other languages underexplored. To alleviate these limitations, in this paper,
we present CATS, a pragmatic Chinese answer-to-sequence dataset with large
scale and high quality. The dataset aims to generate textual descriptions for
the answer in the practical TableQA system. Further, to bridge the structural
gap between the input SQL and table and establish better semantic alignments,
we propose a Unified Graph Transformation approach to establish a joint
encoding space for the two hybrid knowledge resources and convert this task to
a graph-to-text problem. The experiment results demonstrate the effectiveness
of our proposed method. Further analysis on CATS attests to both the high
quality and challenges of the dataset.
Related papers
- InfoAffect: A Dataset for Affective Analysis of Infographics [21.63643063062395]
We introduce a 3.5k-sample affect-annotated InfoAffect dataset, which combines textual content with real-world infographics.<n>Five state-of-the-art multimodal large language models (MLLMs) then analyze both modalities, and their outputs are fused with Reciprocal Rank Fusion (RRF) algorithm to yield robust affects and confidences.
arXiv Detail & Related papers (2025-11-09T14:35:59Z) - ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information [29.57708536491853]
We propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information.
We release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score.
arXiv Detail & Related papers (2024-11-29T12:48:49Z) - Unleashing the Power of LLMs as Multi-Modal Encoders for Text and Graph-Structured Data [42.18348019901044]
Graph-structured information offers rich contextual information that can enhance language models.
Existing methods for integrating graph and text embeddings are limited in their ability to fully exploit the heterogeneous nature of these modalities.
We propose Janus, a framework that leverages Large Language Models (LLMs) to jointly encode text and graph data.
arXiv Detail & Related papers (2024-10-15T03:40:20Z) - Datasets for Multilingual Answer Sentence Selection [59.28492975191415]
We introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish)
Results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models.
arXiv Detail & Related papers (2024-06-14T16:50:29Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - EventNarrative: A large-scale Event-centric Dataset for Knowledge
Graph-to-Text Generation [8.216976747904726]
EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text, 6 times larger than the current largest parallel dataset.
Our aim is two-fold: help break new ground in event-centric research where data is lacking, and to give researchers a well-defined, large-scale dataset.
arXiv Detail & Related papers (2021-10-30T15:39:20Z) - Towards More Equitable Question Answering Systems: How Much More Data Do
You Need? [15.401330338654203]
We take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages.
Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs.
We make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems.
arXiv Detail & Related papers (2021-05-28T21:32:04Z) - Does Putting a Linguist in the Loop Improve NLU Data Collection? [34.34874979524489]
Crowdsourcing NLP datasets contain systematic gaps and biases that are identified only after data collection is complete.
We take natural language inference as a test case and ask whether it is beneficial to put a linguist in the loop' during data collection.
arXiv Detail & Related papers (2021-04-15T00:31:10Z) - Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question
Answering [8.558954185502012]
We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data.
We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr)
arXiv Detail & Related papers (2020-10-23T20:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.