FuguReport
Browse the latest weekly themes first, then scan the most recent daily reports and archives.
2026-06-05 - 2026-06-11
LLM Research-Agent Evaluation
This week's theme centers on evaluating and improving LLM-based research and problem-solving agents beyond one-shot task success.
Theme 2Structured World Models
This week's papers advance world modeling away from monolithic black-box predictors toward structured, modular architectures designed to better capture the dynamics of diverse environments.
Theme 3Controllable and Scalable Model Merging
This week's theme centers on making model merging more controllable, scalable, and robust as the number of fine-tuned expert models grows.
Recent Daily Reports
Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control
This paper introduces Instruct-Particulate, a feed-forward model for reconstructing articulated 3D objects from a static 3D mesh while conditioning on a target kinematic specification.
SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing
SkillAudit is a framework for improving agent skills without using hidden tests, reference solutions, rewards, or other external ground-truth signals during optimization.
From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI
This survey explores the trajectory of Large Language Models (LLMs) from conversational chatbots to persistent autonomous digital colleagues.
Pano3D: Unified 3D Reconstruction and Panoptic Segmentation
Pano3D is a unified framework that performs 3D reconstruction and 3D panoptic segmentation directly from unposed RGB image collections.
A theoretical model for task routing in mixture-of-expert transformers
This paper develops a theoretical framework for task routing in mixture-of-experts (MoE) transformers using a discrete language model built from syntactic templates and finite key-value knowledge dictionaries.
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
This paper introduces EvoArena, a benchmark suite for evaluating LLM agents in persistently evolving environments rather than static snapshots.
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
RepWAM is a representation-centric world action model designed for robot manipulation.
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
This paper studies how the action interface of a tool-augmented agent affects open-ended spatial reasoning.
Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models
This paper studies model fingerprinting for text-to-image diffusion models under a threat that prior work largely ignores: collusion attacks in which multiple users combine their fingerprinted model copies to weaken attribution.
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
This paper introduces StakeBench, a benchmark for evaluating prompt-injection attacks against real-world LLM web agents from a stakeholder-centric perspective.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
This paper studies autonomous research as a long-horizon optimization problem and formalizes it as Autonomous Optimization (AO), where an agent must iteratively improve an artifact using development feedback while reserving held-out evaluation for admission decisions.
AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation
AudioX-Turbo is a unified framework for generating audio or music from flexible combinations of text, video, and audio conditions.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 is a framework for long-horizon video understanding that reformulates multimodal understanding as Multimodal Contextual Reasoning (MCR), a closed-loop process over an evolving shared context.
4DP-QA: Scalable QA for 4D Perception in Vision Language Models
This paper presents a scalable question-answer generation pipeline for training and evaluating vision-language models on 4D scene understanding, with an emphasis on motion and dynamic spatial reasoning.
Slots, Transitions, Loops: Learning Composable World Models for ARC
This paper studies ARC as demonstration-conditioned state transition learning rather than direct grid-to-grid prediction.
CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
This paper introduces CIAware-Bench, a benchmark for measuring control intervention awareness in frontier language models: whether a model can tell when part of its trajectory has been replaced or edited by a control protocol.
Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting
This paper proposes Causal Ensemble Agent (CEA), a hierarchical causal discovery framework that combines multiple statistical causal discovery experts across three graph levels: skeletons, v-structures, and edge orientations.
Kwai Keye-VL-2.0 Technical Report
Kwai Keye-VL-2.
Accelerating NeurASP with vectorization and caching
This paper studies the computational bottlenecks of NeurASP, a neuro-symbolic framework that trains neural networks through ASP-based reasoning when only downstream labels are available.
End-to-End Context Compression at Scale
This paper revisits encoder-decoder context compression for long-context language model inference, targeting the memory bottleneck created by growing KV caches.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
SpatialWorld is a benchmark for evaluating interactive spatial reasoning in multimodal agents on complex real-world tasks.
EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation
EPS3D is an end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation from unposed multi-view images.
OmniGen-AR: AutoRegressive Any-to-Image Generation
OmniGen-AR is presented as a unified autoregressive framework for any-to-image generation that encodes text and diverse visual conditions into discrete tokens within a single model.
Bridging the Agent-World Gap: Text World Models for LLM-based Agents
This paper surveys text world models (TWMs) for LLM-based agents, starting from the observation that many current agents act reactively without an explicit model of how textual environments change over time.
Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs
This paper studies inference inefficiency in multimodal large language models and argues that deep-layer visual self-attention becomes redundant after visual tokens have already formed stable spatial structure.
Understanding the Parameter Space Geometry of Transformers Encoding Boolean Functions
This paper studies why transformers often fail to learn certain Boolean functions even when those functions are expressible by some parameter settings.
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation
This paper studies whether activation steering can induce emergent misalignment, meaning broadly unsafe behavior that generalizes beyond the narrow task used to derive the steering signal.
Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules
This paper argues that standard flow and diffusion pre-training is limited for scientific discovery because it matches the observed data distribution, which may cover only a small portion of the full valid design space.
MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
This paper studies representation alignment for diffusion transformers from a token-level perspective and argues that aligning all diffusion tokens to clean-image encoder features creates a mismatch because diffusion inputs are noisy and informative content varies by timestep.
Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge
This paper studies how to improve diffusion language model (DLM) decoding so it retains parallel generation speed while better matching the quality of a stronger autoregressive (AR) model.
DisCo: World Models with Discrete Camera Motion Control
DisCo is a controllable video world model that replaces continuous camera trajectories with a compact discrete action space for camera motion control.
Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking
This paper presents a lightweight reinforcement-learning framework for training general-purpose 3D foothold-tracking policies for humanoid locomotion.
REACT 2026: The Fourth Multiple Appropriate Facial Reaction Generation Challenge: Personalised MAFRG and Appropriate EEG Reaction Prediction
This paper presents the REACT 2026 challenge on multiple appropriate facial reaction generation (MAFRG) in dyadic interactions, extending prior editions with a stronger focus on personalisation.
IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations
IntentNav is a framework for ObjectNav that learns human-like search policies from human demonstrations rather than relying on low-level action imitation alone.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
This paper is a survey on multimodal large language model (MLLM) based video understanding, motivated by the shift from short clips to long, multimodal, and knowledge-intensive video scenarios.
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
ThinkBooster is presented as a unified framework for test-time compute scaling in LLM reasoning, intended to support both research and practical deployment.
Planning-aligned Token Compression for Long-Context Autonomous Driving
This paper introduces COMPACT-VA, a planning-aligned token compression framework for long-context autonomous driving built on a conditional VQ-VAE and a hierarchical Q-former memory buffer.
ForensicConcept: Transferable Forensic Concepts for AIGI Detection
This paper studies why AI-generated image detectors often generalize poorly to unseen generators and argues that one obstacle is the lack of explicit, inspectable evidence in current black-box detectors.
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization
AnchorWorld is a framework for embodied egocentric world simulation that combines human-motion-driven control with localized world customization.
Towards World Models in Biomedical Research
This paper is a perspective article that proposes biomedical world models as a new AI paradigm for biomedical research.
Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification
This paper studies KL-regularized contextual bandits and episodic reinforcement learning with general function approximation when the model class is misspecified.
Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement
Goedel-Architect is an agentic Lean 4 theorem-proving pipeline built around a global blueprint: a dependency graph of formally stated definitions and lemmas leading to a target theorem.
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
This paper studies video event prediction, where a model must infer unobserved future events from a partial video.
Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios
This paper introduces Ouvia, a user-centered evaluation framework for assessing the usability of speech translation in realistic one-to-one communication settings rather than decontextualized benchmark tests.
Audio Interaction Model
This paper formalizes the Audio Interaction Model, a streaming audio-language setting in which a model continuously listens to audio and decides when to remain silent or respond.
Agents' Last Exam
Agents’ Last Exam (ALE) is a benchmark for evaluating AI agents on long-horizon, economically valuable real-world tasks with verifiable outcomes.
ZipSplat: Fewer Gaussians, Better Splats
ZipSplat is a feed-forward 3D Gaussian Splatting model that predicts a compact scene representation from multi-view images without tying one Gaussian to each input pixel.
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
AutoLab is a benchmark for ultra long-horizon closed-loop optimization, designed to evaluate whether frontier models can improve working but suboptimal research and engineering artifacts through repeated experimentation and refinement over multi-hour budgets.
Stateful Visual Encoders for Vision-Language Models
This paper studies a limitation of open-weight vision-language models in multi-image and multi-turn settings: their visual encoders usually process each image independently, leaving cross-image comparison to the language model.
Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation
This paper addresses model selection for deep unsupervised domain adaptation (UDA), where target labels are unavailable and commonly used validation strategies are biased, unstable, or rely on labeled target data.
Archive
Weekly Archive
15LLM研究エージェントの評価
This week's theme centers on evaluating and improving LLM-based research and problem-solving agents beyond one-shot task success.
構造化ワールドモデル
This week's papers advance world modeling away from monolithic black-box predictors toward structured, modular architectures designed to better capture the dynamics of diverse environments.
制御可能でスケーラブルなモデルマージング
This week's theme centers on making model merging more controllable, scalable, and robust as the number of fine-tuned expert models grows.
身体化ワールドモデルと評価
This week's work marks a shift from evaluating multimodal models on static perception toward testing whether they can form actionable, physically grounded world models.
AIガバナンスと安全性
This week's AI safety research emphasizes the shift from broad concern about AI harms toward structured governance and quantitative risk-modeling frameworks.
LLMのエージェント型推論評価
This theme centers on evaluating and structuring LLM reasoning in settings where static prompting or generic inference heuristics break down—especially when retrieval, domain knowledge, and multi-step decision rules must interact.
推薦システムへの強化学習の適用
This week's theme centers on applying reinforcement learning to move recommendation beyond greedy next-item prediction toward long-term user engagement.
整合的視覚表現
This week's papers treat representation quality and cross-scale alignment as a central bottleneck in both generative modeling and general visual pretraining.
視覚言語ナビゲーションにおける空間推論と不確実性
This week's theme centers on how vision-language and embodied models are being tested and redesigned for navigation when spatial reasoning, long-horizon decision-making, and safety become bottlenecks.
LLM共同研究者の評価
This week's theme centers on how LLM-based research agents should be assessed and scaffolded as they move beyond writing support into research planning, experimentation, review, and publication workflows.
身体性VLMのための構造的表現
This week's theme centers on equipping vision-language models with explicit geometric and navigational structure for embodied tasks, moving beyond brittle prompting or task-specific heads.
構造化された効率的な拡散モデル編集
This theme centers on diffusion models that move beyond generic text-to-image generation toward more structured, grounded, and computationally practical image editing and perception.
統合的自己回帰画像生成・編集
This week saw continued progress toward unified models that combine image generation, editing, and understanding within single autoregressive or hybrid autoregressive-diffusion architectures.
LLMマルチエージェント協調
This theme centers on coordinating multiple LLM-based agents to handle tasks beyond what a single model instance can easily support.
生成的3D再構成と映像理解
This week's theme centers on methods that recover richer scene structure and semantics from limited video observations.