Related papers: A Hierarchical Approach to exploiting Multiple Datasets from TalkBank

A Hierarchical Approach to exploiting Multiple Datasets from TalkBank

URL: http://arxiv.org/abs/2306.12596v1
Date: Wed, 21 Jun 2023 22:37:51 GMT
Title: A Hierarchical Approach to exploiting Multiple Datasets from TalkBank
Authors: Man Ho Wong
Abstract summary: This paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection. The framework can also be adapted to process data from other open-science platforms.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: TalkBank is an online database that facilitates the sharing of linguistics research data. However, the existing TalkBank's API has limited data filtering and batch processing capabilities. To overcome these limitations, this paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection. This approach involves a quick preliminary screening of relevant corpora that a researcher may need, and then perform an in-depth search for target data based on specific criteria. The identified files are then indexed, providing easier access for future analysis. Furthermore, the paper demonstrates how data from different studies curated with the framework can be integrated by standardizing and cleaning metadata, allowing researchers to extract insights from a large, integrated dataset. While being designed for TalkBank, the framework can also be adapted to process data from other open-science platforms.

Related papers

TARGET: Benchmarking Table Retrieval for Generative Tasks [7.379012456053551]
TARGET is a benchmark for evaluating TAble Retrieval for GEnerative Tasks.<n>We analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks.<n>We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text.
arXiv Detail & Related papers (2025-05-14T19:39:46Z)
Harmonizing Metadata of Language Resources for Enhanced Querying and Accessibility [0.0]
This paper addresses the harmonization of metadata from diverse repositories of language resources (LRs) Our methodology supports text-based search, faceted browsing, and advanced SPARQL queries through Linghub, a newly developed portal. The study highlights significant metadata issues and advocates for adherence to open vocabularies and standards to enhance metadata harmonization.
arXiv Detail & Related papers (2025-01-09T22:48:43Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
Oracle Bone Inscriptions Multi-modal Dataset [58.20314888996118]
Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography. This paper proposes an Oracle Bone Inscriptions Multi-modal dataset, which includes annotation information for 10,077 pieces of oracle bones. This dataset can be used for a variety of AI-related research tasks relevant to the field of OBI, such as OBI Character Detection and Recognition, Rubbing Denoising, Character Matching, Character Generation, Reading Sequence Prediction, Missing Characters Completion task and so on.
arXiv Detail & Related papers (2024-07-04T12:47:32Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval. ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
IQLS: Framework for leveraging Metadata to enable Large Language Model based queries to complex, versatile Data [0.20482269513546458]
The Intelligent Query and Learning System (IQLS) simplifies the process by allowing natural language use to simplify data retrieval. It maps structured data into a framework based on the available metadata and available data models. The IQLS enables the agent to fulfill tasks given by the user query through interfaces.
arXiv Detail & Related papers (2024-05-04T13:44:05Z)
ConvSDG: Session Data Generation for Conversational Search [29.211860955861244]
We propose a framework to explore the feasibility of boosting conversational search by using large language models (LLMs) for session data generation. Within this framework, we design dialogue/session-level and query-level data generation with unsupervised and semi-supervised learning. The generated data are used to fine-tune the conversational dense retriever.
arXiv Detail & Related papers (2024-03-17T20:34:40Z)
Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models [0.0]
The conventional use of the Retrieval-Augmented Generation architecture has proven effective for retrieving information from diverse documents. This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems.
arXiv Detail & Related papers (2024-01-04T16:16:14Z)
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description. To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set. This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z)
QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications [20.507631900617817]
We present QBSUM, a high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization. We also propose multiple unsupervised and supervised solutions to the task and demonstrate their high-speed inference and superior performance via both offline experiments and online A/B tests.
arXiv Detail & Related papers (2020-10-27T07:30:04Z)
Conversations with Search Engines: SERP-based Conversational Response Generation [77.1381159789032]
We create a suitable dataset, the Search as a Conversation (SaaC) dataset, for the development of pipelines for conversations with search engines. We also develop a state-of-the-art pipeline for conversations with search engines, the Conversations with Search Engines (CaSE) using this dataset. CaSE enhances the state-of-the-art by introducing a supporting token identification module and aprior-aware pointer generator.
arXiv Detail & Related papers (2020-04-29T13:07:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.