Appearance-Preserving 3D Convolution for Video-based Person
Re-identification
- URL: http://arxiv.org/abs/2007.08434v2
- Date: Mon, 27 Jul 2020 10:57:04 GMT
- Title: Appearance-Preserving 3D Convolution for Video-based Person
Re-identification
- Authors: Xinqian Gu, Hong Chang, Bingpeng Ma, Hongkai Zhang, Xilin Chen
- Abstract summary: We propose AppearancePreserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel.
It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds.
- Score: 61.677153482995564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the imperfect person detection results and posture changes, temporal
appearance misalignment is unavoidable in video-based person re-identification
(ReID). In this case, 3D convolution may destroy the appearance representation
of person video clips, thus it is harmful to ReID. To address this problem, we
propose AppearancePreserving 3D Convolution (AP3D), which is composed of two
components: an Appearance-Preserving Module (APM) and a 3D convolution kernel.
With APM aligning the adjacent feature maps in pixel level, the following 3D
convolution can model temporal information on the premise of maintaining the
appearance representation quality. It is easy to combine AP3D with existing 3D
ConvNets by simply replacing the original 3D convolution kernels with AP3Ds.
Extensive experiments demonstrate the effectiveness of AP3D for video-based
ReID and the results on three widely used datasets surpass the
state-of-the-arts. Code is available at: https://github.com/guxinqian/AP3D.
Related papers
- MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model [34.245635412589806]
MeshFormer is a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision.
It can be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks.
arXiv Detail & Related papers (2024-08-19T17:55:17Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - A Unified Framework for 3D Point Cloud Visual Grounding [60.75319271082741]
This paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3DRefTR.
Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch.
This elaborate design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model.
arXiv Detail & Related papers (2023-08-23T03:20:31Z) - DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting [28.709044035867596]
We propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting.
DFA3D transforms multi-view 2D image features into a unified 3D space for 3D object detection.
arXiv Detail & Related papers (2023-07-24T17:49:11Z) - TR3D: Towards Real-Time Indoor 3D Object Detection [6.215404942415161]
TR3D is a fully-convolutional 3D object detection model trained end-to-end.
To take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features.
Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset.
arXiv Detail & Related papers (2023-02-06T15:25:50Z) - Tracking People with 3D Representations [78.97070307547283]
We present a novel approach for tracking multiple people in video.
Unlike past approaches which employ 2D representations, we employ 3D representations of people, located in three-dimensional space.
We find that 3D representations are more effective than 2D representations for tracking in these settings.
arXiv Detail & Related papers (2021-11-15T16:15:21Z) - Learnable Sampling 3D Convolution for Video Enhancement and Action
Recognition [24.220358793070965]
We introduce a new module to improve the capability of 3D convolution (emphLS3D-Conv)
We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames.
The experiments on video, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-22T09:20:49Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z) - DSGN: Deep Stereo Geometry Network for 3D Object Detection [79.16397166985706]
There is a large performance gap between image-based and LiDAR-based 3D object detectors.
Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap.
For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline.
arXiv Detail & Related papers (2020-01-10T11:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.