LRG at TREC 2020: Document Ranking with XLNet-Based Models
- URL: http://arxiv.org/abs/2103.00380v1
- Date: Sun, 28 Feb 2021 03:04:29 GMT
- Title: LRG at TREC 2020: Document Ranking with XLNet-Based Models
- Authors: Abheesht Sharma and Harshit Pandey
- Abstract summary: We are given a user's query with a description to find the most relevant short segment from the given dataset having all the podcasts.
Previous techniques that include solely classical Information Retrieval (IR) techniques, perform poorly when descriptive queries are presented.
We experiment with two hybrid models which first filter out the best podcasts based on user's query with a classical IR technique, and then perform re-ranking on the shortlisted documents based on the detailed description.
- Score: 0.9023847175654602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Establishing a good information retrieval system in popular mediums of
entertainment is a quickly growing area of investigation for companies and
researchers alike. We delve into the domain of information retrieval for
podcasts. In Spotify's Podcast Challenge, we are given a user's query with a
description to find the most relevant short segment from the given dataset
having all the podcasts. Previous techniques that include solely classical
Information Retrieval (IR) techniques, perform poorly when descriptive queries
are presented. On the other hand, models which exclusively rely on large neural
networks tend to perform better. The downside to this technique is that a
considerable amount of time and computing power are required to infer the
result. We experiment with two hybrid models which first filter out the best
podcasts based on user's query with a classical IR technique, and then perform
re-ranking on the shortlisted documents based on the detailed description using
a transformer-based model.
Related papers
- Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract) [73.57710917145212]
Learning to rank is widely employed in web searches to prioritize pertinent webpages based on input queries.
We propose a emphulineGenerative ulineSemi-ulineSupervised ulinePre-trained (GS2P) model to address these challenges.
We conduct extensive offline experiments on both a publicly available dataset and a real-world dataset collected from a large-scale search engine.
arXiv Detail & Related papers (2024-09-25T03:39:14Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Topic Modeling on Podcast Short-Text Metadata [0.9539495585692009]
We assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using modeling techniques for short text.
We propose a new strategy to named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization modeling framework.
Our experiments on two existing datasets from Spotify and iTunes and Deezer, show that our proposed document representation, NEiCE, leads to improved coherence over the baselines.
arXiv Detail & Related papers (2022-01-12T11:07:05Z) - Spotify at TREC 2020: Genre-Aware Abstractive Podcast Summarization [4.456617185465443]
The goal of this challenge was to generate short, informative summaries that contain the key information present in a podcast episode.
We propose two summarization models that explicitly take genre and named entities into consideration.
Our models are abstractive, and supervised using creator-provided descriptions as ground truth summaries.
arXiv Detail & Related papers (2021-04-07T18:27:28Z) - PodSumm -- Podcast Audio Summarization [0.0]
We propose a method to automatically construct a podcast summary via guidance from the text-domain.
Motivated by a lack of datasets for this task, we curate an internal dataset, find an effective scheme for data augmentation, and design a protocol to gather summaries from annotators.
Our method achieves ROUGE-F(1/2/L) scores of 0.63/0.53/0.63 on our dataset.
arXiv Detail & Related papers (2020-09-22T04:49:33Z) - A Baseline Analysis for Podcast Abstractive Summarization [18.35061145103997]
This paper presents a baseline analysis of podcast summarization using the Spotify Podcast dataset.
It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
arXiv Detail & Related papers (2020-08-24T18:38:42Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.