A Hierarchical Approach to exploiting Multiple Datasets from TalkBank
- URL: http://arxiv.org/abs/2306.12596v1
- Date: Wed, 21 Jun 2023 22:37:51 GMT
- Title: A Hierarchical Approach to exploiting Multiple Datasets from TalkBank
- Authors: Man Ho Wong
- Abstract summary: This paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection.
The framework can also be adapted to process data from other open-science platforms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: TalkBank is an online database that facilitates the sharing of linguistics
research data. However, the existing TalkBank's API has limited data filtering
and batch processing capabilities. To overcome these limitations, this paper
introduces a pipeline framework that employs a hierarchical search approach,
enabling efficient complex data selection. This approach involves a quick
preliminary screening of relevant corpora that a researcher may need, and then
perform an in-depth search for target data based on specific criteria. The
identified files are then indexed, providing easier access for future analysis.
Furthermore, the paper demonstrates how data from different studies curated
with the framework can be integrated by standardizing and cleaning metadata,
allowing researchers to extract insights from a large, integrated dataset.
While being designed for TalkBank, the framework can also be adapted to process
data from other open-science platforms.
Related papers
- BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - Oracle Bone Inscriptions Multi-modal Dataset [58.20314888996118]
Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography.
This paper proposes an Oracle Bone Inscriptions Multi-modal dataset, which includes annotation information for 10,077 pieces of oracle bones.
This dataset can be used for a variety of AI-related research tasks relevant to the field of OBI, such as OBI Character Detection and Recognition, Rubbing Denoising, Character Matching, Character Generation, Reading Sequence Prediction, Missing Characters Completion task and so on.
arXiv Detail & Related papers (2024-07-04T12:47:32Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - IQLS: Framework for leveraging Metadata to enable Large Language Model based queries to complex, versatile Data [0.20482269513546458]
The Intelligent Query and Learning System (IQLS) simplifies the process by allowing natural language use to simplify data retrieval.
It maps structured data into a framework based on the available metadata and available data models.
The IQLS enables the agent to fulfill tasks given by the user query through interfaces.
arXiv Detail & Related papers (2024-05-04T13:44:05Z) - ConvSDG: Session Data Generation for Conversational Search [29.211860955861244]
We propose a framework to explore the feasibility of boosting conversational search by using large language models (LLMs) for session data generation.
Within this framework, we design dialogue/session-level and query-level data generation with unsupervised and semi-supervised learning.
The generated data are used to fine-tune the conversational dense retriever.
arXiv Detail & Related papers (2024-03-17T20:34:40Z) - Beyond Extraction: Contextualising Tabular Data for Efficient
Summarisation by Language Models [0.0]
The conventional use of the Retrieval-Augmented Generation architecture has proven effective for retrieving information from diverse documents.
This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems.
arXiv Detail & Related papers (2024-01-04T16:16:14Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - QBSUM: a Large-Scale Query-Based Document Summarization Dataset from
Real-world Applications [20.507631900617817]
We present QBSUM, a high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.
We also propose multiple unsupervised and supervised solutions to the task and demonstrate their high-speed inference and superior performance via both offline experiments and online A/B tests.
arXiv Detail & Related papers (2020-10-27T07:30:04Z) - Conversations with Search Engines: SERP-based Conversational Response
Generation [77.1381159789032]
We create a suitable dataset, the Search as a Conversation (SaaC) dataset, for the development of pipelines for conversations with search engines.
We also develop a state-of-the-art pipeline for conversations with search engines, the Conversations with Search Engines (CaSE) using this dataset.
CaSE enhances the state-of-the-art by introducing a supporting token identification module and aprior-aware pointer generator.
arXiv Detail & Related papers (2020-04-29T13:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.