2026-04-29 Daily Report: HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

Authors Runzhong Zhang, Suchen Wang, Yueqi Duan, Yansong Tang, Yue Zhang, Yap-Peng Tan

Affiliations Nanyang Technological University / Tsinghua University / Beijing Jiaotong University

Categories Method / Action Segmentation / Weakly-supervised segmentation with HOI, Application / Human-Object Interaction / Leveraging HOI for video-level prior, Task / Video Understanding / Temporal-spatial interaction modeling

License CC BY 4.0

Abstract Overview

This paper introduces AdaAct, an HOI-aware adaptive network for weakly supervised action segmentation under transcript supervision. The method addresses confusion between visually similar actions (e.g., pouring juice vs. pouring coffee) by exploiting temporally global but spatially local human-object interaction (HOI) cues as video-level prior knowledge. The pipeline includes a video HOI encoder that extracts, selects, and integrates representative interactions via a ViT-based network, and a two-branch HyperNetwork that combines HOI-dependent and HOI-independent knowledge to generate parameters for an adaptive temporal encoder (GRU + linear layer). The approach is evaluated on the Breakfast and 50Salads benchmarks for both action segmentation and action alignment tasks.

Novelty

The paper's main novelty is using video-level HOI information to dynamically adapt the temporal encoder's parameters at test time for weakly supervised action segmentation. It introduces a specific architecture for this: a three-step video HOI encoder (extracting, selecting, integrating via ViT) combined with a two-branch HyperNetwork that separately models HOI-dependent and HOI-independent knowledge, merged through element-wise multiplication and late fusion.

Results

On action segmentation, AdaAct reports 51.2 MoF on Breakfast (exceeding the next best by 1.4%) and 55.6 MoF on 50Salads (exceeding by 0.9%). On action alignment, it achieves 64.4 MoF on Breakfast and 69.8 MoF on 50Salads, outperforming compared approaches. Ablation studies show HOI-dependent knowledge provides the largest single gain (+3.7% MoF), with HOI-independent knowledge (+0.9%) and the multi-head mechanism (+1.8%) contributing additional improvements.

Key Points

AdaAct uses representative human-object interactions selected across the full video (via a video-NMS algorithm and ViT-based integrator) as prior knowledge to disambiguate visually similar actions such as different pouring actions.
The model adapts its temporal encoder parameters per video through a two-branch HyperNetwork that fuses HOI-dependent cues with HOI-independent transferable knowledge via element-wise multiplication.
Experiments on Breakfast and 50Salads show consistent improvements over prior methods on both action segmentation and alignment tasks, with especially large per-activity gains on activities containing ambiguous actions (e.g., +13.7% MoF on coffee-making, +15.9% on pancake-making).

References

arXiv: https://arxiv.org/abs/2604.26227v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.26227v1