FuguReport

Paper Espresso: From Paper Overload to Research Insight

Authors Mingzhe Du, Luu Anh Tuan, Dong Huang, See-kiong Ng
Affiliations Nanyang Technological University / National University of Singapore
Categories Method / Metadata Extraction / Automated paper metadata generation, Application / Research Assistance / Insight generation from arXiv papers, Evaluation / Model Evaluation / Large-scale system deployment analysis
License CC BY 4.0

Abstract Overview

Paper Espresso is an open-source platform that proactively monitors community-trending arXiv papers by ingesting from the Hugging Face Daily Papers feed (~2–3% of arXiv), rather than relying on query-based discovery. It uses LLMs (specifically Google Gemini) to generate structured bilingual summaries, open-vocabulary topic labels, and keywords, and presents daily, monthly, and lifecycle-level trend views through a Streamlit web interface. The system has run continuously for 35 months (May 2023–April 2026), processing over 13,300 papers, and publicly releases all structured metadata as date-partitioned Parquet datasets on Hugging Face. Using this corpus, the paper conducts longitudinal empirical analyses of AI research dynamics, including topic emergence, co-occurrence structure, lifecycle behavior, and the relationship between topic novelty and community engagement.

Novelty

The work's distinctive contribution is its combination of proactive, continuous paper monitoring with LLM-based structured metadata generation and multi-timescale trend analysis (daily, monthly, lifecycle) in a single openly released system. It goes beyond paper summarization by performing monthly topic consolidation (~50:1 compression), Gartner Hype Cycle-style lifecycle classification using statistical indicators, and novelty-engagement analysis using Pointwise Mutual Information over topic co-occurrences.

Results

Over 35 months, Paper Espresso processed 13,388 papers and released associated summaries, topic labels (6,673 unique coarse-grained topics), keywords, and trend reports as public datasets. The empirical analysis reports non-saturating topic emergence (up to 408 new topics/month) with stable Shannon entropy (~7.9 bits), a mid-2025 surge in reinforcement learning for LLM reasoning (driven by GRPO and RLVR), asymmetric topic velocity (median 8 months to peak but 1-month half-life), and a positive association between unusual topic combinations and higher community engagement, with the most novel papers receiving approximately 2.0× median upvotes.

Key Points

  1. The platform targets the community-curated ~2–3% of arXiv surfaced by Hugging Face Daily Papers, using Google Gemini via LiteLLM to convert each paper into structured bilingual metadata (summaries, pros/cons, open-vocabulary topic labels, and keywords) stored as date-partitioned Parquet files.
  2. Its released datasets support daily, monthly, and lifecycle-level analysis, including LLM-driven topic consolidation (~50:1 compression), keyword evolution tracking, co-occurrence structure mapping via Jaccard similarity, and Gartner Hype Cycle-style topic classification using statistical indicators (peak proportion, decline ratio, trend slope).
  3. The 35-month longitudinal analysis shows that AI research topics continue to diversify (non-saturating emergence with stable entropy), topics peak slowly (median 8 months) but decay rapidly (median 1-month half-life), and papers with more novel topic combinations—measured by negated mean PMI—receive approximately 2.0× the median upvotes of conventionally-combined papers.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.