News Signals: An NLP Library for Text and Time Series
- URL: http://arxiv.org/abs/2312.11399v1
- Date: Mon, 18 Dec 2023 18:02:41 GMT
- Title: News Signals: An NLP Library for Text and Time Series
- Authors: Chris Hokamp and Demian Gholipour Ghalandari and Parsa Ghaffari
- Abstract summary: News Signals is an open-source library for building and using datasets where inputs are clusters of textual data.
It supports diverse data science and NLP problem settings related to the prediction of time series behaviour.
- Score: 3.850666668546735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an open-source Python library for building and using datasets
where inputs are clusters of textual data, and outputs are sequences of real
values representing one or more time series signals. The news-signals library
supports diverse data science and NLP problem settings related to the
prediction of time series behaviour using textual data feeds. For example, in
the news domain, inputs are document clusters corresponding to daily news
articles about a particular entity, and targets are explicitly associated
real-valued time series: the volume of news about a particular person or
company, or the number of pageviews of specific Wikimedia pages. Despite many
industry and research use cases for this class of problem settings, to the best
of our knowledge, News Signals is the only open-source library designed
specifically to facilitate data science and research settings with natural
language inputs and time series targets. In addition to the core codebase for
building and interacting with datasets, we also conduct a suite of experiments
using several popular Machine Learning libraries, which are used to establish
baselines for time series anomaly prediction using textual inputs.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - A Comprehensive Python Library for Deep Learning-Based Event Detection
in Multivariate Time Series Data and Information Retrieval in NLP [0.0]
We present a new deep learning supervised method for detecting events in time series data.
It is based on regression instead of binary classification.
It does not require labeled datasets where each point is labeled.
It only requires reference events defined as time points or intervals of time.
arXiv Detail & Related papers (2023-10-25T09:13:19Z) - TemporAI: Facilitating Machine Learning Innovation in Time Domain Tasks
for Medicine [91.3755431537592]
TemporAI is an open source Python software library for machine learning (ML) tasks involving data with a time component.
It supports data in time series, static, and eventmodalities and provides an interface for prediction, causal inference, and time-to-event analysis.
arXiv Detail & Related papers (2023-01-28T17:57:53Z) - PyRelationAL: A Library for Active Learning Research and Development [0.11545092788508224]
PyRelationAL is an open source library for active learning (AL) research.
It provides access to benchmark datasets and AL task configurations based on existing literature.
We perform experiments on the PyRelationAL collection of benchmark datasets and showcase the considerable economies that AL can provide.
arXiv Detail & Related papers (2022-05-23T08:21:21Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - A Framework for Neural Topic Modeling of Text Corpora [6.340447411058068]
We introduce FAME, an open-source framework enabling an efficient mechanism of extracting and incorporating textual features.
To demonstrate the effectiveness of this library, we conducted experiments on the well-known News-Group dataset.
arXiv Detail & Related papers (2021-08-19T23:32:38Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.