Efficient Reasoning LLMs
This theme centers on making reasoning-oriented LLMs more efficient in both training and inference, rather than treating stronger reasoning as purely a scaling problem.
Browse released weekly reports in reverse chronological order.
This theme centers on making reasoning-oriented LLMs more efficient in both training and inference, rather than treating stronger reasoning as purely a scaling problem.
This week's theme centers on how video models are being evaluated and redesigned for stronger temporal reasoning, especially in action and egocentric settings.
This theme centers on new benchmarks and evaluation frameworks for instruction-based image editing, motivated by the gap between advancing visual generation and reliable edit assessment.
This week's theme centers on evaluating and improving LLM-based research and problem-solving agents beyond one-shot task success.
This week's papers advance world modeling away from monolithic black-box predictors toward structured, modular architectures designed to better capture the dynamics of diverse environments.
This week's theme centers on making model merging more controllable, scalable, and robust as the number of fine-tuned expert models grows.
This week's work marks a shift from evaluating multimodal models on static perception toward testing whether they can form actionable, physically grounded world models.
This week's AI safety research emphasizes the shift from broad concern about AI harms toward structured governance and quantitative risk-modeling frameworks.
This theme centers on evaluating and structuring LLM reasoning in settings where static prompting or generic inference heuristics break down—especially when retrieval, domain knowledge, and multi-step decision rules must interact.
This week's theme centers on applying reinforcement learning to move recommendation beyond greedy next-item prediction toward long-term user engagement.
This week's papers treat representation quality and cross-scale alignment as a central bottleneck in both generative modeling and general visual pretraining.
This week's theme centers on how vision-language and embodied models are being tested and redesigned for navigation when spatial reasoning, long-horizon decision-making, and safety become bottlenecks.
This week's theme centers on how LLM-based research agents should be assessed and scaffolded as they move beyond writing support into research planning, experimentation, review, and publication workflows.
This week's theme centers on equipping vision-language models with explicit geometric and navigational structure for embodied tasks, moving beyond brittle prompting or task-specific heads.
This theme centers on diffusion models that move beyond generic text-to-image generation toward more structured, grounded, and computationally practical image editing and perception.
This week saw continued progress toward unified models that combine image generation, editing, and understanding within single autoregressive or hybrid autoregressive-diffusion architectures.
This theme centers on coordinating multiple LLM-based agents to handle tasks beyond what a single model instance can easily support.
This week's theme centers on methods that recover richer scene structure and semantics from limited video observations.
This week's theme centers on benchmark work that evaluates world, video, and multi-view generation models beyond surface-level visual quality.
This week's reinforcement learning theme centers on making agents learn richer behaviors through curriculum design and modular skill representations.
This theme tracks activation steering as an inference-time method to control and adapt language models without modifying parameters.
This week's papers frame advanced video and multimodal generative systems as emerging world models rather than mere content generators.
This theme centers on how to evaluate LLM-based agents for scientific research and complex information seeking under realistic, controlled conditions.
This week's theme centers on discrete and masked diffusion language models as an alternative to autoregressive LLMs, with particular emphasis on how decoding order shapes capability and efficiency.
This week's evaluation work highlights persistent gaps between how visual models are assessed and the conditions they face in practice.
This theme addresses how to evaluate and improve models' understanding of temporal structure in video.
This week's work reflects a shift from building GUI-capable VLM/LLM agents toward evaluating them more rigorously across platforms, capability levels, and failure modes.
This week saw multiple new competition benchmarks that extend image restoration evaluation beyond single-degradation settings.
This week's representative papers address how to scale large language models more efficiently through mixture-of-experts architectures and smarter pre-training data-mixture design.
This week's progress centers on making diffusion-based multimedia generation more temporally coherent and controllable as these models expand from images into video and audio.
This week's theme centers on evaluating 3D reconstruction under realistic adverse conditions—noisy video, human-object interaction, and sparse or degraded observations.
This week saw continued progress on using transformer-based pretraining to enable in-context adaptation in sequential decision-making without weight updates.
This week's papers focus on making LLM agents more reliable on complex, long-horizon tasks by improving how they store, extract, share, and secure knowledge.
This week's representative papers highlight that medical AI progress hinges on clearer evaluation frameworks and richer clinical context, not only on stronger models.
This week's papers center on how to organize LLM-based multi-agent systems for complex, real-world tasks.
This theme centers on how LLM outputs can be attributed to supporting documents so that generated answers are more transparent, verifiable, and trustworthy.
This week's papers focus on making multimodal foundation models more efficient without sacrificing broad utility.
This week's theme concerns adapting and evaluating speech models when labeled in-domain data are scarce, domains shift, or speech departs from typical patterns.
This week's papers frame AI deployment as an environmental and governance challenge.
This week's evaluation work pushes beyond narrow benchmark settings toward broader tests for LLM- and VLM-based agents.
This week's theme centers on privacy evaluation in federated learning, where shared gradients, parameters, or predictions can leak sensitive information even when raw data stays on-device.
This week's papers treat the environmental impact of AI infrastructure as a direct evaluation concern.