Paper Espresso: From Paper Overload to Research Insight
Abstract Overview
Paper Espresso is an open-source platform that proactively monitors community-trending arXiv papers by ingesting from the Hugging Face Daily Papers feed (~2–3% of arXiv), rather than relying on query-based discovery. It uses LLMs (specifically Google Gemini) to generate structured bilingual summaries, open-vocabulary topic labels, and keywords, and presents daily, monthly, and lifecycle-level trend views through a Streamlit web interface. The system has run continuously for 35 months (May 2023–April 2026), processing over 13,300 papers, and publicly releases all structured metadata as date-partitioned Parquet datasets on Hugging Face. Using this corpus, the paper conducts longitudinal empirical analyses of AI research dynamics, including topic emergence, co-occurrence structure, lifecycle behavior, and the relationship between topic novelty and community engagement.
Novelty
The work's distinctive contribution is its combination of proactive, continuous paper monitoring with LLM-based structured metadata generation and multi-timescale trend analysis (daily, monthly, lifecycle) in a single openly released system. It goes beyond paper summarization by performing monthly topic consolidation (~50:1 compression), Gartner Hype Cycle-style lifecycle classification using statistical indicators, and novelty-engagement analysis using Pointwise Mutual Information over topic co-occurrences.
Results
Over 35 months, Paper Espresso processed 13,388 papers and released associated summaries, topic labels (6,673 unique coarse-grained topics), keywords, and trend reports as public datasets. The empirical analysis reports non-saturating topic emergence (up to 408 new topics/month) with stable Shannon entropy (~7.9 bits), a mid-2025 surge in reinforcement learning for LLM reasoning (driven by GRPO and RLVR), asymmetric topic velocity (median 8 months to peak but 1-month half-life), and a positive association between unusual topic combinations and higher community engagement, with the most novel papers receiving approximately 2.0× median upvotes.
Key Points
- The platform targets the community-curated ~2–3% of arXiv surfaced by Hugging Face Daily Papers, using Google Gemini via LiteLLM to convert each paper into structured bilingual metadata (summaries, pros/cons, open-vocabulary topic labels, and keywords) stored as date-partitioned Parquet files.
- Its released datasets support daily, monthly, and lifecycle-level analysis, including LLM-driven topic consolidation (~50:1 compression), keyword evolution tracking, co-occurrence structure mapping via Jaccard similarity, and Gartner Hype Cycle-style topic classification using statistical indicators (peak proportion, decline ratio, trend slope).
- The 35-month longitudinal analysis shows that AI research topics continue to diversify (non-saturating emergence with stable entropy), topics peak slowly (median 8 months) but decay rapidly (median 1-month half-life), and papers with more novel topic combinations—measured by negated mean PMI—receive approximately 2.0× the median upvotes of conventionally-combined papers.
References
- arXiv: https://arxiv.org/abs/2604.04562v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.04562v1
- Hugging Face Papers: https://huggingface.co/papers/2604.04562
- Hugging Face: https://huggingface.co/spaces/Elfsong/Paper_Espresso