GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration
- URL: http://arxiv.org/abs/2306.01481v1
- Date: Fri, 2 Jun 2023 12:09:59 GMT
- Title: GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration
- Authors: Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde
Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast,
Jimmy Lin
- Abstract summary: We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
- Score: 97.68234051078997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Noticing the urgent need to provide tools for fast and user-friendly
qualitative analysis of large-scale textual corpora of the modern NLP, we
propose to turn to the mature and well-tested methods from the domain of
Information Retrieval (IR) - a research field with a long history of tackling
TB-scale document collections. We discuss how Pyserini - a widely used toolkit
for reproducible IR research can be integrated with the Hugging Face ecosystem
of open-source AI libraries and artifacts. We leverage the existing
functionalities of both platforms while proposing novel features further
facilitating their integration. Our goal is to give NLP researchers tools that
will allow them to develop retrieval-based instrumentation for their data
analytics needs with ease and agility. We include a Jupyter Notebook-based walk
through the core interoperability features, available on GitHub at
https://github.com/huggingface/gaia. We then demonstrate how the ideas we
present can be operationalized to create a powerful tool for qualitative data
analysis in NLP. We present GAIA Search - a search engine built following
previously laid out principles, giving access to four popular large-scale text
collections. GAIA serves a dual purpose of illustrating the potential of
methodologies we discuss but also as a standalone qualitative analysis tool
that can be leveraged by NLP researchers aiming to understand datasets prior to
using them in training. GAIA is hosted live on Hugging Face Spaces -
https://huggingface.co/spaces/spacerini/gaia.
Related papers
- Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach [0.7806050661713976]
The Sustainable Development Goals were formulated by the United Nations in 2015 to address these global challenges by 2030.
Natural language processing techniques can help uncover discussions on SDGs within research literature.
We propose a completely automated pipeline to fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs.
arXiv Detail & Related papers (2024-11-05T09:37:23Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - MEGAnno: Exploratory Labeling for NLP in Computational Notebooks [9.462926987075122]
We present MEGAnno, a novel annotation framework designed for NLP practitioners and researchers.
With MEGAnno, users can explore data through sophisticated search and interactive suggestion functions.
We demonstrate MEGAnno's flexible, exploratory, efficient, and seamless labeling experience through a sentiment analysis use case.
arXiv Detail & Related papers (2023-01-08T19:16:22Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Sionna: An Open-Source Library for Next-Generation Physical Layer
Research [64.77840557164266]
Sionna is a GPU-accelerated open-source library for link-level simulations based on ray kernels.
Sionna implements a wide breadth of carefully tested state-of-the-art algorithms that can be used for benchmarking and end-to-end performance evaluation.
arXiv Detail & Related papers (2022-03-22T16:31:44Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - A Flexible Clustering Pipeline for Mining Text Intentions [6.599344783327053]
We create a flexible and scalable clustering pipeline within the Verint Intent Manager.
It integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques.
As deployed in the VIM application, this clustering pipeline produces high quality results.
arXiv Detail & Related papers (2022-02-01T22:54:18Z) - DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature [0.7349727826230862]
We open source DRIFT, which allows researchers to track research trends and development over the years.
The analysis methods are collated from well-cited research works, with a few of our own methods added for good measure.
To demonstrate the utility and efficacy of our tool, we perform a case study on the cs.CL corpus of the arXiv repository and draw inferences from the analysis methods.
arXiv Detail & Related papers (2021-07-02T17:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.