Modeling Cross-view Interaction Consistency for Paired Egocentric
Interaction Recognition
- URL: http://arxiv.org/abs/2003.10663v1
- Date: Tue, 24 Mar 2020 05:05:34 GMT
- Title: Modeling Cross-view Interaction Consistency for Paired Egocentric
Interaction Recognition
- Authors: Zhongguo Li, Fan Lyu, Wei Feng, Song Wang
- Abstract summary: Paired egocentric interaction recognition (PEIR) is the task to collaboratively recognize the interactions between two persons with the videos in their corresponding views.
We propose to build the relevance between two views using biliear pooling, which capture the consistency of two views in feature-level.
Experimental results on dataset PEV shows the superiority of the proposed methods on the task PEIR.
- Score: 16.094976277810556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of Augmented Reality (AR), egocentric action recognition
(EAR) plays important role in accurately understanding demands from the user.
However, EAR is designed to help recognize human-machine interaction in single
egocentric view, thus difficult to capture interactions between two
face-to-face AR users. Paired egocentric interaction recognition (PEIR) is the
task to collaboratively recognize the interactions between two persons with the
videos in their corresponding views. Unfortunately, existing PEIR methods
always directly use linear decision function to fuse the features extracted
from two corresponding egocentric videos, which ignore consistency of
interaction in paired egocentric videos. The consistency of interactions in
paired videos, and features extracted from them are correlated to each other.
On top of that, we propose to build the relevance between two views using
biliear pooling, which capture the consistency of two views in feature-level.
Specifically, each neuron in the feature maps from one view connects to the
neurons from another view, which guarantee the compact consistency between two
views. Then all possible paired neurons are used for PEIR for the inside
consistent information of them. To be efficient, we use compact bilinear
pooling with Count Sketch to avoid directly computing outer product in
bilinear. Experimental results on dataset PEV shows the superiority of the
proposed methods on the task PEIR.
Related papers
- ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction [16.338872733140832]
This paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG)
Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding.
The Ego-IRGBench dataset includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions.
arXiv Detail & Related papers (2025-04-02T08:24:35Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Human-to-Human Interaction Detection [3.00604614803979]
We introduce a new task named human-to-human interaction detection (HID)
HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model.
First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I)
arXiv Detail & Related papers (2023-07-02T03:24:58Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Joint Engagement Classification using Video Augmentation Techniques for
Multi-person Human-robot Interaction [22.73774398716566]
We present a novel framework for identifying a parent-child dyad's joint engagement.
Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame- and skeleton-based joint engagement recognition models.
Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context.
arXiv Detail & Related papers (2022-12-28T23:52:55Z) - A Hierarchical Interactive Network for Joint Span-based Aspect-Sentiment
Analysis [34.1489054082536]
We propose a hierarchical interactive network (HI-ASA) to model two-way interactions between two tasks appropriately.
We use cross-stitch mechanism to combine the different task-specific features selectively as the input to ensure proper two-way interactions.
Experiments on three real-world datasets demonstrate HI-ASA's superiority over baselines.
arXiv Detail & Related papers (2022-08-24T03:03:49Z) - A Co-Interactive Transformer for Joint Slot Filling and Intent Detection [61.109486326954205]
Intent detection and slot filling are two main tasks for building a spoken language understanding (SLU) system.
Previous studies either model the two tasks separately or only consider the single information flow from intent to slot.
We propose a Co-Interactive Transformer to consider the cross-impact between the two tasks simultaneously.
arXiv Detail & Related papers (2020-10-08T10:16:52Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.