Related papers: Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

URL: http://arxiv.org/abs/2408.02337v1
Date: Mon, 5 Aug 2024 09:23:49 GMT
Title: Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Authors: Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz,
Abstract summary: We introduce a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading (MRC), and Information Retrieval (IR) We provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.
Score: 43.045596895389345
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.

Related papers

Automated CVE Analysis: Harnessing Machine Learning In Designing Question-Answering Models For Cybersecurity Information Extraction [0.0]
Question Answering (QA) systems play a pivotal role in the mapping of relationships between various data points. In the cybersecurity context, QA systems encounter unique challenges due to the need to interpret and answer questions based on a wide array of domain-specific information. This paper presents a novel dataset and describes a machine learning model trained on this dataset for the QA task.
arXiv Detail & Related papers (2024-12-21T04:50:45Z)
RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.49774529790693]
This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training Large Language Models. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl.
arXiv Detail & Related papers (2024-12-04T15:27:39Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z)
Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z)
Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models [7.399563588835834]
Interactive-KBQA is a framework designed to generate logical forms through direct interaction with knowledge bases (KBs) Our method achieves competitive results on the WebQuestionsSP, ComplexWebQuestions, KQA Pro, and MetaQA datasets.
arXiv Detail & Related papers (2024-02-23T06:32:18Z)
ADMUS: A Progressive Question Answering Framework Adaptable to Multiple Knowledge Sources [9.484792817869671]
We present ADMUS, a progressive knowledge base question answering framework designed to accommodate a wide variety of datasets. Our framework supports the seamless integration of new datasets with minimal effort, only requiring creating a dataset-related micro-service at a negligible cost.
arXiv Detail & Related papers (2023-08-09T08:46:39Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
Cross-Lingual Question Answering over Knowledge Base as Reading Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base. One of the major challenges facing xKBQA is the high cost of data annotation. We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
Towards More Equitable Question Answering Systems: How Much More Data Do You Need? [15.401330338654203]
We take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages. Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs. We make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems.
arXiv Detail & Related papers (2021-05-28T21:32:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.