Related papers: Improving ProtoNet for Few-Shot Video Object Recognition: Winner of ORBIT Challenge 2022

Improving ProtoNet for Few-Shot Video Object Recognition: Winner of ORBIT Challenge 2022

URL: http://arxiv.org/abs/2210.00174v1
Date: Sat, 1 Oct 2022 03:03:20 GMT
Title: Improving ProtoNet for Few-Shot Video Object Recognition: Winner of ORBIT Challenge 2022
Authors: Li Gu, Zhixiang Chi, Huan Liu, Yuanhao Yu, Yang Wang
Abstract summary: We present the winning solution for ORBIT Few-Shot Video Object Recognition Challenge 2022. Built upon the ProtoNet baseline, the performance of our method is improved with three effective techniques.
Score: 28.27029433676475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we present the winning solution for ORBIT Few-Shot Video Object Recognition Challenge 2022. Built upon the ProtoNet baseline, the performance of our method is improved with three effective techniques. These techniques include the embedding adaptation, the uniform video clip sampler and the invalid frame detection. In addition, we re-factor and re-implement the official codebase to encourage modularity, compatibility and improved performance. Our implementation accelerates the data loading in both training and testing.

Related papers

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling [71.8265422228785]
Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been hindered by the lack of a high-fidelity, efficient reward signal.<n>We present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model.
arXiv Detail & Related papers (2025-09-28T14:28:24Z)
InstructVEdit: A Holistic Approach for Instructional Video Editing [28.13673601495108]
InstructVEdit is a full-cycle instructional video editing approach that establishes a reliable dataset curation workflow. It incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency. It also proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies.
arXiv Detail & Related papers (2025-03-22T04:12:20Z)
Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content. We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z)
3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z)
InstructVideo: Instructing Video Diffusion Models with Human Feedback [65.9590462317474]
We propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing.
arXiv Detail & Related papers (2023-12-19T17:55:16Z)
Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video. Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability. We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z)
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z)
A Dual-level Detection Method for Video Copy Detection [13.517933749704866]
Meta AI hold Video Similarity Challenge on CVPR 2023 to push the technology forward. We propose a dual-level detection method with Video Editing Detection (VED) and Frame Scenes Detection (FSD) to tackle the core challenges on Video Copy Detection.
arXiv Detail & Related papers (2023-05-21T06:19:08Z)
3rd Place Solution to Meta AI Video Similarity Challenge [1.1470070927586016]
This paper presents our 3rd place solution in the Meta AI Video Similarity Challenge (VSC2022) Our approach builds upon existing image copy detection techniques and incorporates several strategies to exploit on the properties of video data.
arXiv Detail & Related papers (2023-04-24T10:00:09Z)
DFA: Dynamic Feature Aggregation for Efficient Video Object Detection [15.897168900583774]
We propose a vanilla dynamic aggregation module that adaptively selects the frames for feature enhancement. We extend the vanilla dynamic aggregation module to a more effective and reconfigurable deformable version. On the ImageNet VID benchmark, integrated with our proposed methods, FGFA and SELSA can improve the inference speed by 31% and 76% respectively.
arXiv Detail & Related papers (2022-10-02T17:54:15Z)
AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision. We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.