FuguReport

Anchor Date: 2026-07-30

Next (2026-07-23)

Weekly

2026-07-24 - 2026-07-30

Multimodal Reasoning Evaluation

This theme centers on how increasingly capable language and multimodal models should be evaluated and controlled as they move beyond text-only benchmarks.

Theme 2

Manipulation Generalization and Evaluation

This week's theme centers on evaluating and improving embodied manipulation in contact-rich settings where vision-only supervision and narrowly collected robot data are insufficient.

Theme 3

LLM Software Engineering Evaluation

This theme centers on evaluating AI coding assistants and software-engineering agents beyond narrow benchmarks or final-output metrics.

Daily

Recent Daily Reports

50 reports

2026-07-30 Method / Self-Distillation / Online self-distillation methods

$β$-OPSD: Deriving with Policy Optimization, Training with Self-Distillation

This paper argues that vanilla on-policy self-distillation (OPSD) can be viewed as the β=1 case of a broader KL-regularized policy-optimization objective that balances movement toward a privileged teacher against staying close to a reference policy.

2026-07-30 Method / Feature Aggregation / Adaptive anchor frame framework

AdaAnchor4D: Anchor-Conditioned Spatiotemporal Feature Aggregation for Monocular UAV 4D Reconstruction

AdaAnchor4D is a monocular UAV 4D reconstruction method designed for urban scenes with heterogeneous local motion over space and time.

2026-07-30 Evaluation / Visual Reasoning Benchmarking / Benchmark for visual-geometric reasoning

JigShape: Evaluating Visual-Geometric Reasoning in VLMs through Jigsaw Puzzles

JigShape is a benchmark for evaluating visual-geometric reasoning in vision-language models through jigsaw puzzle completion.

2026-07-30 Evaluation / Reward Model Evaluation / Assessment of computer-use reward models

OSReward: Instituting Standardized Evaluation for Cross-Platform Computer-Use Reward Models

This paper studies whether vision-language models can reliably act as judges for computer-using agent trajectories, a role that is increasingly used for evaluation, data curation, and reinforcement learning.

2026-07-29 Evaluation / Embodied Interaction Benchmark / HumanCLAW-Bench indoor navigation

HumanCLAW: Can Vision-Language Models Act Through a Body?

HumanCLAW is an evaluation framework for testing whether vision-language models can make effective embodied decisions when their outputs are executed through a humanoid body.

2026-07-29 Method / Model Inversion / One-step inversion framework

FARI: Robust One-Step Inversion for Watermarking in Diffusion Models

This paper studies inversion-based watermark verification for diffusion-generated images and argues that, in practice, robustness to external distortions is more important than minimizing internal inversion truncation error.

2026-07-29 Method / Game Modeling / State-aware game world modeling

StatePlay: State-Aware Game World Models for Mechanics-Consistent Generation

StatePlay is presented as a state-aware game world model for fighting-game generation that jointly predicts video frames and internal game states such as health, skill meters, and timers.

2026-07-29 Evaluation / Model Evaluation / Visual evidence response assessment

Visual Credit Audit for Multimodal Spatial Reasoning

This paper presents Visual Credit Audit (VCA), a decision-level evaluation framework for multimodal spatial reasoning that determines if benchmark images genuinely support a model's answers better than text-only or blank controls.

2026-07-29 Method / Vision-Language Models / Explicit object structure representation

Explicit Kinematic Guidance from Analytic Concepts for Vision-Language-Action Models

This paper presents SAGE, a post-training framework for vision-language-action models that injects explicit object structure and kinematic knowledge through executable Analytic Concepts.

2026-07-28 Method / Agent Framework / Agentic decomposition for design editing

ReDesign: Recovering Editable Design Structures from Images via Agentic Decomposition

This paper studies the reconstruction of editable design files from raster images, aiming to recover hierarchy, text, vector shapes, colors, grouping, and layer order rather than only matching appearance.

2026-07-28 Method / Reward Modeling / Injecting turn-level signals from game solvers

CAST: Game Solvers as Turn-Level Teachers for LLM Agents

This paper studies how to train LLM agents for long-horizon games when standard reinforcement learning with verifiable rewards provides only sparse terminal feedback.

2026-07-28 Method / Memory Management / Self-routing memory controller framework

UniMem: Complementary Episodic-to-Parametric Memory for Boundary-Agnostic Task Streams

UniMem is a memory-management framework for LLM agents operating over heterogeneous, boundary-agnostic task streams.

2026-07-28 Method / Vulnerability Classification / Multi-label CVE classification

Mapping CVEs to MITRE ATT&CK Techniques: A Curated Gold-Set Classifier and the Limits of LLM-Assisted Label Expansion

This paper studies automated mapping of CVE descriptions to MITRE ATT&CK Enterprise techniques using a supervised multi-label classifier trained on a curated gold set of 1,207 CVEs derived from expert MITRE CTID mappings.

2026-07-27 Task / Embodied Manipulation / Data organization for manipulation tasks

Data Pyramid for Embodied Manipulation

This paper is a data-centric survey of embodied manipulation that argues embodied agents need supervision coupling observations with physical states and actions, unlike internet-scale vision-language models.

2026-07-27 Method / Mixture-of-Experts / Routing and shared lightweight experts

MMOE: Modernizing Diffusion Transformers with Efficient Expert Design

This paper presents MMOE, a modernization of SiT-style diffusion transformers that imports several efficiency-oriented mixture-of-experts designs from large language models.

2026-07-27 Method / Multimodal Model Design / Codec-native streaming model architecture

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Mage-VL is a 4B-parameter streaming vision-language model built around Mage-ViT, a codec-native visual encoder that uses motion vectors and residual information to select dynamic patches instead of uniformly sampling frames.

2026-07-27 Method / Visual Generation / Unified autoregressive framework

UniGen-AR: Unifying Visual Generation with Auto-Regressive Modeling

UniGen-AR studies unified visual generation with a single model that produces image-valued outputs for more than 15 tasks across text-to-image generation, editing, restoration, and classical perception.

2026-07-26 Method / Actor-Critic Methods / Joint optimization of token and turn-level rewards

Hybrid Advantage Estimation with Unified Critic for VLM Agentic Reinforcement Learning

This paper studies reinforcement learning for vision-language model agents that operate over multiple interaction turns, where both token-level generation and turn-level decision making matter.

2026-07-26 Method / Vision-and-Language Navigation / Memory-augmented navigation framework

MemVLN: Episodic and Procedural Memory for Vision-and-Language Navigation

MemVLN is a vision-and-language navigation framework for continuous environments that aims to balance long-horizon visual memory with low-latency control.

2026-07-26 Method / Pose Tracking / Robust 6D pose tracking in dynamic scenes

RRTrack: Robust and Recoverable Object 6D Pose Tracking for Dynamic Scenes

RRTrack is a training-free 6D object pose tracking framework designed for dynamic scenes with fast motion, occlusion, and disappearance–reappearance events.

2026-07-26 Method / Attribution / Regularization under geometric transformations

Consistent Evidence, Robust Recognition: Faithful Attribution Regularization under Geometric Transformations

This paper proposes an annotation-free attribution regularization framework for image recognition that aims to enforce consistent evidence use under label-preserving geometric transformations.

2026-07-25 Evaluation / Image Segmentation Benchmarking / Food instance segmentation evaluation

DishSeg24k: A Large-Scale Benchmark for Food Segmentation with Stochastic Expert Decoding

This paper introduces DishSeg24k, a large-scale dish-level food segmentation benchmark designed to reflect real-world dining scenes with dense inter-dish overlap, fine-grained class similarity, and long-tail category imbalance.

2026-07-25 Task / Continual Learning / Class incremental learning without retraining

Breaking the Synthetic-Real Domain Shortcut for Training-Free Generative Replay-based Class Incremental Learning

This paper studies training-free generative replay for exemplar-free class-incremental learning and identifies a failure mode that arises when synthetic old-class images are mixed directly with real new-class images during incremental training.

2026-07-25 Method / Reward Design / Practical reward framework for RL

SeekJudge: A Practical Reward Framework for Reinforcement Learning in Computer-Use Agents

The paper addresses reward design for reinforcement learning in computer-use agents, focusing on how to judge whether long GUI interaction trajectories actually satisfy an instruction.

2026-07-25 Method / Genetic Algorithm / Efficient binary decision space search

A genetic algorithm for student academic resource allocation

The paper formulates personalized academic resource selection for a single high school mathematics student as a 0–1 binary combinatorial optimization problem with a strict maximum study-time constraint.

2026-07-24 Method / Graphical Models / Factor graph-based inference for MSA

Evolution-Aware MSA Reasoning for Subsampling via Factor Graphs

This paper studies multiple sequence alignment (MSA) subsampling for protein language models under fixed token budgets, arguing that current approaches are mostly heuristics with limited control over which evolutionary signals are retained.

2026-07-24 Evaluation / Model Evaluation / Performance degradation after merging

Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs

This paper studies how the post-training paradigm affects model merging in large language models by comparing reinforcement learning (RL)-trained models with supervised fine-tuned (SFT) models.

2026-07-24 Application / Social Robotics / Goal-oriented social navigation tasks

ACME: A Multi-Cultural, Multi-Embodiment Social-Navigation Dataset

This paper introduces ACME, a multi-modal social-navigation dataset designed to capture how robot and pedestrian behavior varies across cultural contexts, environments, and robot forms.

2026-07-24 Method / Graph Neural Networks / Adaptive graph coarsening techniques

Efficient Recommendations via Graph Coarsening and Label Propagation

This paper studies large-scale graph-based recommendation in telecommunications, where full-graph propagation is computationally expensive and Graph Neural Networks (GNNs) can run out of memory.

2026-07-24 Method / Low-Rank Methods / Low-rank adaptation optimization

On the Convergence of Stochastic Low-Rank Adaptation

This paper studies the optimization theory of simultaneous two-factor LoRA, where a frozen pretrained weight matrix is adapted through a low-rank update parameterized by factors B and A.

2026-07-23 Method / Self-Distillation / Visual contrastive distillation technique

Visual Contrastive Self-Distillation

This paper introduces Visual Contrastive Self-Distillation (VCSD), an on-policy self-distillation method for vision-language models that does not rely on an external teacher, privileged answers, reasoning traces, or explicit visual evidence signals.

2026-07-23 Method / View Synthesis / Efficient dynamic novel view synthesis

GrainGS: Gradient-Decoupled Gaussian Splatting for Efficient Dynamic Novel View Synthesis

GrainGS is a dynamic 3D Gaussian Splatting method for novel view synthesis that combines a hierarchical anchor scaffold with per-Gaussian temporal deformation.

2026-07-23 Method / Multi-Agent Collaboration / Modular agent cooperation framework

Agentic Designer: Progressive Multi-Agent Collaboration for Structure-Aware Interior Layout Generation

This paper introduces Agentic Designer, a structure-aware interior layout generation framework that treats layout synthesis as an iterative decision process rather than one-shot prediction.

2026-07-23 Evaluation / Benchmarking / Multi-domain coding-agent evaluation suite

Tencent WorkBuddy Bench: A Multi-Domain Coding-Agent Benchmark with Contamination-Resistant Task Construction

Tencent WorkBuddy Bench is an open, multi-domain benchmark for coding agents spanning Code, Web, Office, and Security tasks under a shared execution harness.

2026-07-23 Method / Model Compression / Compression for ASR models on edge CPUs

VibeVoice-ASR-BitNet Technical Report

This technical report presents VibeVoice-ASR-BitNet, a compressed version of VibeVoice-ASR designed for real-time automatic speech recognition on edge CPUs.

2026-07-22 Method / Video Generation / Native long video extrapolation

Self Gradient Forcing: Native Long Video Extrapolation

This paper studies long-horizon autoregressive video diffusion and argues that existing Self Forcing training still leaves a missing supervision path for how self-generated history is written into the model’s key-value memory.

2026-07-22 Method / Mixture-of-Experts / Large-scale MoE language modeling

Solar Open 2 Technical Report

Solar Open 2 is a 250B-parameter, 15B-active Mixture-of-Experts language model designed for long-horizon agentic tasks and Korean-language strength.

2026-07-22 Method / Multi-Object Tracking / Spike-driven tracker design

SpikingMOT: A Spike-Driven Multi-Object Tracker

This paper studies multi-object tracking from the perspective of activation sparsity, arguing that dense neural responses may be unnecessary and sometimes counterproductive for trajectory prediction.

2026-07-22 Method / Model Evaluation / Evaluator feedback post-training

Co-Evolving LLM Evaluators and Policies via DynamicRubric

This paper studies evaluator-guided post-training for large language models and argues that policy optimization depends critically on preserving relative score gaps among candidate responses sampled from the current policy.

2026-07-21 Method / World Modeling / Unified visual and action modeling

Masked Visual Actions for Unified World Modeling

This paper introduces Masked Visual Actions, a pixel-space control interface for video world models in which action is represented as a partially revealed trajectory of an entity in the video.

2026-07-21 Method / Self-Supervised Learning / Multi-interest modeling in conversation

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation

This paper studies the Matthew effect in conversational recommender systems, where repeated user-system interactions can increasingly favor popular items and narrow users' exposed interests over time.

2026-07-21 Method / Optimization / Spectrum-based optimization stack

ISO: An RLVR-Native Optimization Stack

This paper studies reinforcement learning with verifiable rewards (RLVR) through the singular-value structure of model weights and argues that RLVR often preserves the base model’s spectra while changing the associated left and right singular frames.

2026-07-20 Method / Domain Generalization / Training framework for domain generalization

Simple Domain Generalization for Strong Pixel-Level Image Tampering Detection in Modern VLMs

This paper studies domain generalization for pixel-level image tampering detection in modern vision-language models, focusing on robustness under cross-model and out-of-distribution generator shifts.

2026-07-20 Evaluation / Cryptanalysis Evaluation / Performance on cryptographic breaking tasks

CryptanalysisBench: Can LLMs do Cryptanalysis?

This paper introduces CryptanalysisBench, a benchmark comprising 191 cryptanalysis tasks across six families of cryptographic primitives, primarily drawn from NIST standardization competitions.

2026-07-20 Method / Video Personalization / Human-object centric video generation framework

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

HOMIE is a human-object centric video personalization framework designed to handle both inter-subject references, where different images depict different entities, and intra-subject references, where multiple inputs describe the same subject such as OCR maps or multi-view views.

2026-07-20 Method / Graph Learning / Hypergraph multi-preference learning

HyCoRec: Hypergraph-Enhanced Multi-Preference Learning for Alleviating Matthew Effect in Conversational Recommendation

This paper studies the Matthew effect in conversational recommender systems, where repeated user-system interactions can increasingly favor already popular items and suppress less exposed ones.

2026-07-20 Evaluation / Speech Benchmarking / Performance gap with humans

ESCUCHA: A Spanish Speech Benchmark for Heterogeneous Acoustic Conditions

ESCUCHA is presented as the first Spanish speech understanding benchmark for evaluating large audio-language models under heterogeneous acoustic conditions.

2026-07-19 Evaluation / Model Robustness Evaluation / Adversarial attack performance

ALLUDE: A Unified Evaluation System for Configurable Attacks in Differentiable Environments

ALLUDE is a unified evaluation system for adversarial attacks on vision models in differentiable, photorealistic simulation environments.

2026-07-19 Evaluation / Benchmarking / GUI state-transition benchmarks

EvoGUI: An Evolution-Aware Benchmark for GUI State-Transition Understanding

EvoGUI is a diagnostic framework for GUI state-transition understanding that converts normalized GUI trajectories into three visual question answering probes: temporal ordering, inverse action/value prediction, and one-step successor discrimination.

2026-07-19 Method / Model Modification / Refusal removal effects on decision layers

Abliteration Is Not a Scalpel: Off-Target Effects of Refusal Removal on Decision Disposition Across Model Families

This paper tests whether abliteration, a weight-space intervention used to remove refusal behavior from open-weight models, has behavioral side effects beyond refusal.

Anchor Date: 2026-07-30

Next (2026-07-23)

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.

FuguReport

2026-07-24 - 2026-07-30

Multimodal Reasoning Evaluation

Manipulation Generalization and Evaluation

LLM Software Engineering Evaluation

Recent Daily Reports

$β$-OPSD: Deriving with Policy Optimization, Training with Self-Distillation

AdaAnchor4D: Anchor-Conditioned Spatiotemporal Feature Aggregation for Monocular UAV 4D Reconstruction

JigShape: Evaluating Visual-Geometric Reasoning in VLMs through Jigsaw Puzzles

OSReward: Instituting Standardized Evaluation for Cross-Platform Computer-Use Reward Models

HumanCLAW: Can Vision-Language Models Act Through a Body?

FARI: Robust One-Step Inversion for Watermarking in Diffusion Models

StatePlay: State-Aware Game World Models for Mechanics-Consistent Generation

Visual Credit Audit for Multimodal Spatial Reasoning

Explicit Kinematic Guidance from Analytic Concepts for Vision-Language-Action Models

ReDesign: Recovering Editable Design Structures from Images via Agentic Decomposition

CAST: Game Solvers as Turn-Level Teachers for LLM Agents

UniMem: Complementary Episodic-to-Parametric Memory for Boundary-Agnostic Task Streams

Mapping CVEs to MITRE ATT&CK Techniques: A Curated Gold-Set Classifier and the Limits of LLM-Assisted Label Expansion

Data Pyramid for Embodied Manipulation

MMOE: Modernizing Diffusion Transformers with Efficient Expert Design

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

UniGen-AR: Unifying Visual Generation with Auto-Regressive Modeling

Hybrid Advantage Estimation with Unified Critic for VLM Agentic Reinforcement Learning

MemVLN: Episodic and Procedural Memory for Vision-and-Language Navigation

RRTrack: Robust and Recoverable Object 6D Pose Tracking for Dynamic Scenes

Consistent Evidence, Robust Recognition: Faithful Attribution Regularization under Geometric Transformations

DishSeg24k: A Large-Scale Benchmark for Food Segmentation with Stochastic Expert Decoding

Breaking the Synthetic-Real Domain Shortcut for Training-Free Generative Replay-based Class Incremental Learning

SeekJudge: A Practical Reward Framework for Reinforcement Learning in Computer-Use Agents

A genetic algorithm for student academic resource allocation

Evolution-Aware MSA Reasoning for Subsampling via Factor Graphs

Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs

ACME: A Multi-Cultural, Multi-Embodiment Social-Navigation Dataset

Efficient Recommendations via Graph Coarsening and Label Propagation

On the Convergence of Stochastic Low-Rank Adaptation

Visual Contrastive Self-Distillation

GrainGS: Gradient-Decoupled Gaussian Splatting for Efficient Dynamic Novel View Synthesis

Agentic Designer: Progressive Multi-Agent Collaboration for Structure-Aware Interior Layout Generation

Tencent WorkBuddy Bench: A Multi-Domain Coding-Agent Benchmark with Contamination-Resistant Task Construction

VibeVoice-ASR-BitNet Technical Report

Self Gradient Forcing: Native Long Video Extrapolation

Solar Open 2 Technical Report

SpikingMOT: A Spike-Driven Multi-Object Tracker

Co-Evolving LLM Evaluators and Policies via DynamicRubric

Masked Visual Actions for Unified World Modeling

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation

ISO: An RLVR-Native Optimization Stack

Simple Domain Generalization for Strong Pixel-Level Image Tampering Detection in Modern VLMs

CryptanalysisBench: Can LLMs do Cryptanalysis?

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

HyCoRec: Hypergraph-Enhanced Multi-Preference Learning for Alleviating Matthew Effect in Conversational Recommendation

ESCUCHA: A Spanish Speech Benchmark for Heterogeneous Acoustic Conditions

ALLUDE: A Unified Evaluation System for Configurable Attacks in Differentiable Environments

EvoGUI: An Evolution-Aware Benchmark for GUI State-Transition Understanding

Abliteration Is Not a Scalpel: Off-Target Effects of Refusal Removal on Decision Disposition Across Model Families

Archive

Weekly Archive

マルチモーダル推論評価

マニピュレーションの汎化と評価

LLMソフトウェア工学評価

LLMコードエージェントの評価

ロボット学習のための身体性世界モデル

対話型LLM行動評価

マルチモーダルモデルの安全性評価

効率的な動画表現と拡散サンプリング

感情モデルのベンチマーク評価

LLM科学的推論の評価

身体性マニピュレーション：世界モデル、触覚フィードバック、評価

動画拡散モデルの品質と効率

動的4Dガウシアン再構成

視覚言語モデルの評価

マルチモーダルLLMの帰属と評価