DS4RS: Community-Driven and Explainable Dataset Search Engine for Recommender System Research
- URL: http://arxiv.org/abs/2508.10238v1
- Date: Wed, 13 Aug 2025 23:37:55 GMT
- Title: DS4RS: Community-Driven and Explainable Dataset Search Engine for Recommender System Research
- Authors: Xinyang Shao, Tri Kurniawan Wijaya,
- Abstract summary: We propose a community-driven and explainable dataset search engine tailored for recommender system research.<n>Our system supports semantic search across multiple dataset attributes, such as dataset names, descriptions, and recommendation domain.<n>The system encourages community participation by allowing users to contribute standardized dataset metadata in public repository.
- Score: 0.5294604210205507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accessing suitable datasets is critical for research and development in recommender systems. However, finding datasets that match specific recommendation task or domains remains a challenge due to scattered sources and inconsistent metadata. To address this gap, we propose a community-driven and explainable dataset search engine tailored for recommender system research. Our system supports semantic search across multiple dataset attributes, such as dataset names, descriptions, and recommendation domain, and provides explanations of search relevance to enhance transparency. The system encourages community participation by allowing users to contribute standardized dataset metadata in public repository. By improving dataset discoverability and search interpretability, the system facilitates more efficient research reproduction. The platform is publicly available at: https://ds4rs.com.
Related papers
- Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts [0.0]
We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers.<n>Our approach combines large-scale citation-context extraction, schema-guided dataset recognition, and provenance-preserving entity resolution.<n>We release our code, evaluation datasets, and results on GitHub.
arXiv Detail & Related papers (2026-01-08T16:46:06Z) - OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - A Survey on Open Dataset Search in the LLM Era: Retrospectives and Perspectives [13.669798235894064]
We focus on advances in open dataset search beyond traditional approaches that rely on metadata and keywords.<n>LLMs help address complex challenges in query understanding, semantic modeling, and interactive guidance within open dataset search.<n>This work aims to offer a structured reference for researchers and practitioners in the field of open dataset search.
arXiv Detail & Related papers (2025-08-31T07:45:40Z) - From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z) - RecKG: Knowledge Graph for Recommender Systems [3.0969191504482247]
This study proposes RecKG, a standardized knowledge graph for recommender systems.<n> RecKG ensures the consistent representation of entities across different datasets.<n>We apply RecKG to standardize real-world datasets, subsequently developing an application for RecKG using a graph database.
arXiv Detail & Related papers (2025-01-07T07:55:35Z) - Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery.
The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Dataset Regeneration for Sequential Recommendation [69.93516846106701]
We propose a data-centric paradigm for developing an ideal training dataset using a model-agnostic dataset regeneration framework called DR4SR.
To demonstrate the effectiveness of the data-centric paradigm, we integrate our framework with various model-centric methods and observe significant performance improvements across four widely adopted datasets.
arXiv Detail & Related papers (2024-05-28T03:45:34Z) - DNS-Rec: Data-aware Neural Architecture Search for Recommender Systems [79.76519917171261]
This paper addresses the computational overhead and resource inefficiency prevalent in Sequential Recommender Systems (SRSs)<n>We introduce an innovative approach combining pruning methods with advanced model designs.<n>Our principal contribution is the development of a Data-aware Neural Architecture Search for Recommender System (DNS-Rec)
arXiv Detail & Related papers (2024-02-01T07:22:52Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - INODE: Building an End-to-End Data Exploration System in Practice
[Extended Vision] [30.411996388471817]
INODE is an end-to-end data exploration system.
We demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics.
arXiv Detail & Related papers (2021-04-09T05:04:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.