FuguReport
Browse the latest weekly themes first, then scan the most recent daily reports and archives.
2026-03-27 - 2026-04-02
Medical AI Evaluation and Temporal Multimodality
This week's representative papers highlight that medical AI progress hinges on clearer evaluation frameworks and richer clinical context, not only on stronger models.
Theme 2LLM Multi-Agent Frameworks
This week's papers center on how to organize LLM-based multi-agent systems for complex, real-world tasks.
Theme 3LLM Attribution and Citation Evaluation
This theme centers on how LLM outputs can be attributed to supporting documents so that generated answers are more transparent, verifiable, and trustworthy.
Recent Daily Reports
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
OpenVLThinkerV2 is a multimodal reasoning model built on Qwen3-VL-Instruct-8B and trained with a novel reinforcement learning objective called Gaussian GRPO (G²RPO).
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
This paper introduces OmniBehavior, a user simulation benchmark constructed from real-world Kuaishou platform logs rather than synthetic or isolated-scenario data.
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
KnowU-Bench is an online benchmark for evaluating mobile agents on personalization, interaction, and proactive assistance beyond explicit instruction following.
Small Vision-Language Models are Smart Compressors for Long Video Understanding
This paper introduces Tempo, a 6B-parameter query-aware framework that compresses long videos for downstream reasoning by multimodal large language models (MLLMs).
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
This paper proposes DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System), a paradigm for streaming proactive AI agents that infer latent user needs from ongoing context rather than waiting for explicit queries.
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
This paper presents the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which replaces categorical polarity labels in aspect-based sentiment analysis with continuous valence-arousal (VA) scores.
MARS: Enabling Autoregressive Models Multi-Token Generation
This paper introduces MARS (Mask AutoRegression), a lightweight fine-tuning method that enables instruction-tuned autoregressive language models to predict multiple tokens per forward pass while preserving standard left-to-right autoregressive behavior.
Fast Spatial Memory with Elastic Test-Time Training
This paper identifies that Large Chunk Test-Time Training (LaCT) for long-context 3D/4D reconstruction suffers from catastrophic forgetting and overfitting due to fully plastic fast-weight updates, and is typically limited to a single large chunk spanning the full input sequence.
BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes
BiDexGrasp presents a large-scale bimanual dexterous grasp dataset and a learning-based generation framework for coordinated two-hand grasping of objects with diverse geometries and sizes.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom is a query-aware adaptive perception framework for multimodal large language models (MLLMs) that reduces the cost of high-resolution visual processing by routing queries through a lightweight dynamic gating network.
FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
FunREC is a training-free, optimization-based method that reconstructs functional 3D digital twins of indoor scenes from a single egocentric RGB-D interaction video.
Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
Paper Circle is an open-source multi-agent framework for scientific literature discovery and analysis, built on two complementary pipelines.
Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
Market-Bench is a benchmark for evaluating large language models in a competitive supply-chain economy where agents must handle both quantitative decisions and marketing language.
Action Images: End-to-End Policy Learning via Multiview Video Generation
This paper introduces Action Images, a unified world-action model that formulates robot policy learning as multiview video generation.
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
This paper introduces ReV (Referring-Aware Visuomotor Policy), a closed-loop imitation learning framework for robotic manipulation that incorporates sparse 3D referring points provided by a human or a high-level planner during execution.
Archive
Weekly Archive
9医療AIの評価と時間的マルチモダリティ
This week's representative papers highlight that medical AI progress hinges on clearer evaluation frameworks and richer clinical context, not only on stronger models.
LLMマルチエージェントフレームワーク
This week's papers center on how to organize LLM-based multi-agent systems for complex, real-world tasks.
LLMの帰属と引用評価
This theme centers on how LLM outputs can be attributed to supporting documents so that generated answers are more transparent, verifiable, and trustworthy.
効率的マルチモーダル基盤モデル
This week's papers focus on making multimodal foundation models more efficient without sacrificing broad utility.
非定型・ドメインシフト音声に対する音声モデル適応
This week's theme concerns adapting and evaluating speech models when labeled in-domain data are scarce, domains shift, or speech departs from typical patterns.
AIの持続可能性と信頼性
This week's papers frame AI deployment as an environmental and governance challenge.
包括的なLLMエージェント評価
This week's evaluation work pushes beyond narrow benchmark settings toward broader tests for LLM- and VLM-based agents.
連合学習におけるプライバシー推論
This week's theme centers on privacy evaluation in federated learning, where shared gradients, parameters, or predictions can leak sensitive information even when raw data stays on-device.
AIの持続可能性と信頼性
This week's papers treat the environmental impact of AI infrastructure as a direct evaluation concern.
Daily Archive
38OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib is a standardized inference framework and codebase for advanced world models, motivated by the absence of a widely accepted definition of what constitutes a world model.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
This paper presents the first real-world safety evaluation of OpenClaw, a widely deployed personal AI agent with full local system access and integrations to services such as Gmail, Stripe, and the filesystem.
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
FileGram is a unified framework for personalizing file-system agents using behavioral traces (action sequences and content deltas) rather than dialogue history alone.
Structured Causal Video Reasoning via Multi-Objective Alignment
This paper proposes a structure-first framework for video reasoning in which a model first produces Structured Event Facts—compact, time-ordered descriptions of salient events and their causal relations—and then reasons under those constraints.
Paper Espresso: From Paper Overload to Research Insight
Paper Espresso is an open-source platform that continuously discovers, summarizes, and analyzes community-trending arXiv papers sourced from the Hugging Face Daily Papers feed (approximately 2–3% of arXiv).
Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner
This paper extends the Decision Pre-Trained Transformer (DPT) framework to cross-domain in-context reinforcement learning in continuous-control settings by integrating a flow-based action head trained via rectified flow matching.
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
This paper presents the results of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, which evaluates robust 3D reconstruction pipelines under real-world adverse conditions using the RealX3D benchmark.
Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization
This paper provides a systematic stability and generalization analysis for first-order stochastic bilevel optimization (SBO) methods.
Combee: Scaling Prompt Learning for Self-Improving Language Model Agents
Combee is a framework for scaling prompt learning in self-improving language model agents under high parallelism.
Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics
This paper proposes EGInterpolator, a two-stage framework for molecular dynamics (MD) trajectory generation.
Relay-Assisted Activation-Integrated SIM for Wireless Physical Neural Networks
This paper proposes a relay-assisted wireless physical neural network (WPNN) architecture based on activation-integrated stacked intelligent metasurfaces (AI-SIMs).
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
This paper studies prompt retrieval for visual in-context learning (VICL) and argues that existing methods overemphasize visual similarity while neglecting prompt labels.
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
This paper addresses Partially Relevant Video Retrieval (PRVR), where a text query describes only a segment of an untrimmed video, making retrieval susceptible to spurious local matches.
Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
This paper analyzes expert routing patterns in multilingual Mixture-of-Experts (MoE) models and identifies a phenomenon termed Language Routing Isolation, where high-resource and low-resource languages activate largely disjoint expert sets.
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
This paper introduces ActivityForensics, a benchmark for temporal localization of manipulated human activities in videos, targeting semantic changes in human actions rather than appearance-level edits such as face swaps or object removal.
SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization
This paper presents SecPI, a fine-tuning pipeline for reasoning language models (RLMs) that aims to make secure code generation the default behavior without requiring explicit security prompts at inference time.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL is a modular multi-encoder vision-language framework that integrates a contrastively trained SigLIP2 encoder with a self-supervised DINOv3 encoder to improve both semantic understanding and spatial grounding.
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal is a multimodal benchmark designed to evaluate large multimodal models (MLLMs) on real-world polymer science workflows rather than isolated scientific subtasks.
Do Audio-Visual Large Language Models Really See and Hear?
This paper presents the first mechanistic interpretability study of Audio-Visual Large Language Models (AVLLMs), analyzing how audio and visual representations evolve and fuse across transformer layers during caption generation.
EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment
This paper addresses the efficiency and consistency shortcomings of existing evaluation metrics for infrared-visible image fusion (IVIF), which are largely borrowed from other vision tasks without adaptation.
Verbalizing LLMs' assumptions to explain and control sycophancy
This paper introduces Verbalized Assumptions, a framework for eliciting LLMs' inferred assumptions about users through both open-ended and structured prompting, and connects these assumptions to social sycophancy.
NearID: Identity Representation Learning via Near-identity Distractors
This paper identifies a systematic failure mode in vision encoders used for identity-focused tasks: embeddings entangle object identity with background context, allowing visually similar but distinct objects placed on the same background to outscore true cross-view matches.
A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
This paper addresses affordance reasoning in 3D Gaussian Splatting (3DGS) scenes, where the goal is to localize the region supporting a text-specified action.
Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
This paper introduces LinkS2Bench, a benchmark for evaluating vision-language models (VLMs) on dynamic UAV-satellite cross-view spatial intelligence.
Steerable Visual Representations
This paper introduces SteerViT, a method that makes pretrained vision transformer (ViT) representations steerable via natural language by inserting lightweight gated cross-attention layers into frozen ViT blocks, enabling text to influence intermediate visual features through early fusion.
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
CORAL is a framework for autonomous multi-agent evolution on open-ended discovery tasks, replacing fixed evolutionary heuristics with long-running agents that decide what to retrieve, test, evaluate, and store.
VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
VideoZeroBench is a hierarchical benchmark for long-video question answering that evaluates not only answer correctness but also whether models identify the correct temporal and spatial evidence.
Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
This paper introduces the Proactive Agent Research Environment (Pare), a framework for building and evaluating proactive assistants through interaction with active user simulators in digital environments.
Do Phone-Use Agents Respect Your Privacy?
This paper investigates whether phone-use agents handle user data appropriately while completing benign mobile tasks.
Diff3R: Feed-forward 3D Gaussian Splatting with Uncertainty-aware Differentiable Optimization
Diff3R is a framework for feed-forward 3D Gaussian Splatting (3DGS) that trains models to produce initializations explicitly optimized for subsequent test-time refinement, rather than solely for zero-shot prediction.
OrgAgent: Organize Your Multi-Agent System like a Company
This paper introduces OrgAgent, a company-style hierarchical multi-agent system that separates collaboration into governance, execution, and compliance layers.
Deconfounding Scores and Representation Learning for Causal Effect Estimation with Weak Overlap
This paper addresses causal treatment effect estimation under weak overlap between treated and control covariate distributions, a setting where standard estimators become unstable, particularly in high dimensions.
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
PET-DINO extends the text-prompted Grounding DINO detector to support both text and visual prompts for open-set object detection.
Square Superpixel Generation and Representation Learning via Granular Ball Computing
This paper proposes a square superpixel generation method inspired by granular-ball computing, designed to produce grid-aligned, multi-scale square regions that are more compatible with modern deep learning pipelines than irregular superpixels.
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL (Decoupling Intent and Action via Latent World Modeling) is an end-to-end vision-language-action framework that separates high-level intent formation from low-level motor execution through a differentiable latent intent bottleneck.
Cold-Starts in Generative Recommendation: A Reproducibility Study
This paper presents a systematic reproducibility study of generative recommendation under unified cold-start protocols, covering both new-user and new-item settings.
Curvature-Guided LoRA: Steering in the pretrained NTK subspace
This paper introduces the prediction alignment problem for parameter-efficient fine-tuning (PEFT), which aims to match the outputs of a LoRA-adapted model to those of full fine-tuning at the function level rather than aligning parameter updates.
Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses
This paper identifies a systematic robustness overestimation problem in dummy-class-based adversarial defenses (e.