Weekly Reports

LLM Code Agent Evaluation

2026-07-17 – 2026-07-23

This theme centers on evaluating LLM-based coding agents on realistic software-engineering tasks, especially repository-level issue resolution.

Embodied World Models for Robot Learning

2026-07-17 – 2026-07-23

This week's theme centers on using embodied world models not just to generate realistic futures, but to evaluate, supervise, and improve robotic policies.

Interactive LLM Behavior Evaluation

2026-07-17 – 2026-07-23

This week's theme centers on evaluating LLM behavior in interactive, socially grounded settings rather than judging single-turn text quality alone.

Multimodal Model Safety Evaluation

2026-07-10 – 2026-07-16

This theme centers on evaluating how safety alignment breaks down when language models operate across images and broader multimodal inputs.

Efficient Video Representations and Diffusion Sampling

2026-07-10 – 2026-07-16

This theme centers on making visual generation and reconstruction more practical by replacing slow per-scene optimization with feed-forward Gaussian representations and by improving video diffusion efficiency.

Affective Model Benchmarking

2026-07-10 – 2026-07-16

This week's benchmarking work in affective AI addresses the need for evaluation resources that better reflect the complexity of human emotion.

LLM Scientific Reasoning Evaluation

2026-07-03 – 2026-07-09

This theme centers on scientific reasoning as a demanding evaluation setting for LLMs, where models are tested on expert-level, multi-step problems rather than surface-form recall.

Embodied Manipulation: World Models, Tactile Feedback, and Evaluation

2026-07-03 – 2026-07-09

This week's work centers on robot manipulation systems that move beyond reactive control by integrating predictive world models and richer contact feedback.

Video Diffusion Model Quality and Efficiency

2026-07-03 – 2026-07-09

This week's work on video diffusion models advances along three coupled fronts: generation quality, temporal/spatial consistency, and inference efficiency.

Dynamic 4D Gaussian Reconstruction

2026-06-26 – 2026-07-02

This theme centers on reconstructing and rendering dynamic 3D/4D scenes from monocular or casual video using Gaussian-based representations.

Vision-Language Evaluation

2026-06-26 – 2026-07-02

This week's theme centers on how vision-language models should be evaluated and improved when standard web-scale data and static benchmarks fail to capture real capability.

Multimodal LLM Attribution & Evaluation

2026-06-26 – 2026-07-02

This theme centers on evaluating whether LLM systems can answer long-form, multi-document questions while staying grounded in evidence drawn from both text and visual materials.

LLM Agent Environments & Evaluation

2026-06-19 – 2026-06-25

This week's theme centers on evaluating and improving LLM agents through richer environments rather than static datasets alone.

Cross-Model and Cross-Modal Representation Alignment

2026-06-19 – 2026-06-25

This week's theme centers on making learned representations interoperable across tokenizers, models, and modalities.

Generative Segmentation with Diffusion Models

2026-06-19 – 2026-06-25

This theme centers on reframing segmentation from conventional discriminative per-pixel prediction toward generative mask construction and refinement using pretrained diffusion models.

Efficient Reasoning LLMs

2026-06-12 – 2026-06-18

This theme centers on making reasoning-oriented LLMs more efficient in both training and inference, rather than treating stronger reasoning as purely a scaling problem.

Temporal Reasoning for Egocentric and Action Video

2026-06-12 – 2026-06-18

This week's theme centers on how video models are being evaluated and redesigned for stronger temporal reasoning, especially in action and egocentric settings.

Image Editing Benchmarks

2026-06-12 – 2026-06-18

This theme centers on new benchmarks and evaluation frameworks for instruction-based image editing, motivated by the gap between advancing visual generation and reliable edit assessment.

LLM Research-Agent Evaluation

2026-06-05 – 2026-06-11

This week's theme centers on evaluating and improving LLM-based research and problem-solving agents beyond one-shot task success.

Structured World Models

2026-06-05 – 2026-06-11

This week's papers advance world modeling away from monolithic black-box predictors toward structured, modular architectures designed to better capture the dynamics of diverse environments.

Controllable and Scalable Model Merging

2026-06-05 – 2026-06-11

This week's theme centers on making model merging more controllable, scalable, and robust as the number of fine-tuned expert models grows.

Embodied World Models & Evaluation

2026-05-29 – 2026-06-04

This week's work marks a shift from evaluating multimodal models on static perception toward testing whether they can form actionable, physically grounded world models.

AI Governance and Safety

2026-05-29 – 2026-06-04

This week's AI safety research emphasizes the shift from broad concern about AI harms toward structured governance and quantitative risk-modeling frameworks.

Agentic Reasoning Evaluation for LLMs

2026-05-29 – 2026-06-04

This theme centers on evaluating and structuring LLM reasoning in settings where static prompting or generic inference heuristics break down—especially when retrieval, domain knowledge, and multi-step decision rules must interact.

Reinforcement Learning for Recommendation

2026-05-22 – 2026-05-28

This week's theme centers on applying reinforcement learning to move recommendation beyond greedy next-item prediction toward long-term user engagement.

Aligned Visual Representations

2026-05-22 – 2026-05-28

This week's papers treat representation quality and cross-scale alignment as a central bottleneck in both generative modeling and general visual pretraining.

Spatial Reasoning and Uncertainty in Vision-Language Navigation

2026-05-22 – 2026-05-28

This week's theme centers on how vision-language and embodied models are being tested and redesigned for navigation when spatial reasoning, long-horizon decision-making, and safety become bottlenecks.

Evaluating LLM Co-Scientists

2026-05-15 – 2026-05-21

This week's theme centers on how LLM-based research agents should be assessed and scaffolded as they move beyond writing support into research planning, experimentation, review, and publication workflows.

Structured Representations for Embodied VLMs

2026-05-15 – 2026-05-21

This week's theme centers on equipping vision-language models with explicit geometric and navigational structure for embodied tasks, moving beyond brittle prompting or task-specific heads.

Structured and Efficient Diffusion Editing

2026-05-15 – 2026-05-21

This theme centers on diffusion models that move beyond generic text-to-image generation toward more structured, grounded, and computationally practical image editing and perception.

Unified Autoregressive Image Generation and Editing

2026-05-08 – 2026-05-14

This week saw continued progress toward unified models that combine image generation, editing, and understanding within single autoregressive or hybrid autoregressive-diffusion architectures.

LLM Multi-Agent Collaboration

2026-05-08 – 2026-05-14

This theme centers on coordinating multiple LLM-based agents to handle tasks beyond what a single model instance can easily support.

Generative 3D Reconstruction and Video Understanding

2026-05-08 – 2026-05-14

This week's theme centers on methods that recover richer scene structure and semantics from limited video observations.

Holistic Evaluation for World and Video Models

2026-05-01 – 2026-05-07

This week's theme centers on benchmark work that evaluates world, video, and multi-view generation models beyond surface-level visual quality.

Curriculum and Diverse Skill Learning in RL

2026-05-01 – 2026-05-07

This week's reinforcement learning theme centers on making agents learn richer behaviors through curriculum design and modular skill representations.

Activation Steering and Representation Geometry

2026-05-01 – 2026-05-07

This theme tracks activation steering as an inference-time method to control and adapt language models without modifying parameters.

Generative Models as World Models

2026-04-24 – 2026-04-30

This week's papers frame advanced video and multimodal generative systems as emerging world models rather than mere content generators.

Scientific Research Agent Benchmarking

2026-04-24 – 2026-04-30

This theme centers on how to evaluate LLM-based agents for scientific research and complex information seeking under realistic, controlled conditions.

Diffusion Language Models and Token Ordering

2026-04-24 – 2026-04-30

This week's theme centers on discrete and masked diffusion language models as an alternative to autoregressive LLMs, with particular emphasis on how decoding order shapes capability and efficiency.

Model Evaluation and Benchmarking

2026-04-17 – 2026-04-23

This week's evaluation work highlights persistent gaps between how visual models are assessed and the conditions they face in practice.

Temporal Video Reasoning Evaluation

2026-04-17 – 2026-04-23

This theme addresses how to evaluate and improve models' understanding of temporal structure in video.

GUI Agent Evaluation

2026-04-17 – 2026-04-23

This week's work reflects a shift from building GUI-capable VLM/LLM agents toward evaluating them more rigorously across platforms, capability levels, and failure modes.

Unified Image Restoration Benchmarking

2026-04-10 – 2026-04-16

This week saw multiple new competition benchmarks that extend image restoration evaluation beyond single-degradation settings.

Efficient MoE Methods for LLMs

2026-04-10 – 2026-04-16

This week's representative papers address how to scale large language models more efficiently through mixture-of-experts architectures and smarter pre-training data-mixture design.

Temporal Control in Multimedia Generation

2026-04-10 – 2026-04-16

This week's progress centers on making diffusion-based multimedia generation more temporally coherent and controllable as these models expand from images into video and audio.

Robust 3D Reconstruction Evaluation

2026-04-03 – 2026-04-09

This week's theme centers on evaluating 3D reconstruction under realistic adverse conditions—noisy video, human-object interaction, and sparse or degraded observations.

In-Context RL with Transformers

2026-04-03 – 2026-04-09

This week saw continued progress on using transformer-based pretraining to enable in-context adaptation in sequential decision-making without weight updates.

LLM Agent Memory and Collaboration

2026-04-03 – 2026-04-09

This week's papers focus on making LLM agents more reliable on complex, long-horizon tasks by improving how they store, extract, share, and secure knowledge.

Medical AI Evaluation and Temporal Multimodality

2026-03-27 – 2026-04-02

This week's representative papers highlight that medical AI progress hinges on clearer evaluation frameworks and richer clinical context, not only on stronger models.

LLM Multi-Agent Frameworks

2026-03-27 – 2026-04-02

This week's papers center on how to organize LLM-based multi-agent systems for complex, real-world tasks.

LLM Attribution and Citation Evaluation

2026-03-27 – 2026-04-02

This theme centers on how LLM outputs can be attributed to supporting documents so that generated answers are more transparent, verifiable, and trustworthy.

Efficient Multimodal Foundation Models

2026-03-19 – 2026-03-26

This week's papers focus on making multimodal foundation models more efficient without sacrificing broad utility.

Speech Model Adaptation for Atypical and Shifted Speech

2026-03-19 – 2026-03-26

This week's theme concerns adapting and evaluating speech models when labeled in-domain data are scarce, domains shift, or speech departs from typical patterns.

AI Sustainability and Trustworthiness

2026-03-19 – 2026-03-26

This week's papers frame AI deployment as an environmental and governance challenge.

Comprehensive LLM Agent Evaluation

2026-03-16 – 2026-03-22

This week's evaluation work pushes beyond narrow benchmark settings toward broader tests for LLM- and VLM-based agents.

Federated Learning Privacy Inference

2026-03-16 – 2026-03-22

This week's theme centers on privacy evaluation in federated learning, where shared gradients, parameters, or predictions can leak sensitive information even when raw data stays on-device.

AI Sustainability and Trustworthiness

2026-03-16 – 2026-03-22

This week's papers treat the environmental impact of AI infrastructure as a direct evaluation concern.