Parallel Vertex Diffusion for Unified Visual Grounding
- URL: http://arxiv.org/abs/2303.07216v2
- Date: Tue, 14 Mar 2023 07:48:31 GMT
- Title: Parallel Vertex Diffusion for Unified Visual Grounding
- Authors: Zesen Cheng and Kehan Li and Peng Jin and Xiangyang Ji and Li Yuan and
Chang Liu and Jie Chen
- Abstract summary: Unified visual grounding pursues a simple and generic technical route to leverage multi-task data with less task-specific design.
Most advanced methods typically present boxes and masks as a sequence to model referring detection and segmentation.
- Score: 38.94276071029081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified visual grounding pursues a simple and generic technical route to
leverage multi-task data with less task-specific design. The most advanced
methods typically present boxes and masks as vertex sequences to model
referring detection and segmentation as an autoregressive sequential vertex
generation paradigm. However, generating high-dimensional vertex sequences
sequentially is error-prone because the upstream of the sequence remains static
and cannot be refined based on downstream vertex information, even if there is
a significant location gap. Besides, with limited vertexes, the inferior
fitting of objects with complex contours restricts the performance upper bound.
To deal with this dilemma, we propose a parallel vertex generation paradigm for
superior high-dimension scalability with a diffusion model by simply modifying
the noise dimension. An intuitive materialization of our paradigm is Parallel
Vertex Diffusion (PVD) to directly set vertex coordinates as the generation
target and use a diffusion model to train and infer. We claim that it has two
flaws: (1) unnormalized coordinate caused a high variance of loss value; (2)
the original training objective of PVD only considers point consistency but
ignores geometry consistency. To solve the first flaw, Center Anchor Mechanism
(CAM) is designed to convert coordinates as normalized offset values to
stabilize the training loss value. For the second flaw, Angle summation loss
(ASL) is designed to constrain the geometry difference of prediction and ground
truth vertexes for geometry-level consistency. Empirical results show that our
PVD achieves state-of-the-art in both referring detection and segmentation, and
our paradigm is more scalable and efficient than sequential vertex generation
with high-dimension data.
Related papers
- Dynamic Position Transformation and Boundary Refinement Network for Left Atrial Segmentation [17.09918110723713]
Left atrial (LA) segmentation is a crucial technique for irregular heartbeat (i.e., atrial fibrillation) diagnosis.
Most current methods for LA segmentation strictly assume that the input data is acquired using object-oriented center cropping.
We propose a novel Dynamic Position transformation and Boundary refinement Network (DPBNet) to tackle these issues.
arXiv Detail & Related papers (2024-07-07T22:09:35Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance.
We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary.
We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z) - Enhanced Laser-Scan Matching with Online Error Estimation for Highway
and Tunnel Driving [0.0]
Lidar data can be used to generate point clouds for navigation of autonomous vehicles or mobile robotics platforms.
We propose the Iterative Closest Ellipsoidal Transform (ICET), a scan matching algorithm which provides two novel improvements.
arXiv Detail & Related papers (2022-07-29T13:42:32Z) - Learned Vertex Descent: A New Direction for 3D Human Model Fitting [64.04726230507258]
We propose a novel optimization-based paradigm for 3D human model fitting on images and scans.
Our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art.
LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.
arXiv Detail & Related papers (2022-05-12T17:55:51Z) - E2EC: An End-to-End Contour-based Method for High-Quality High-Speed
Instance Segmentation [4.74225248496056]
We introduce a novel contour-based method, named E2EC, for high-quality instance segmentation.
E2EC is efficient for use in real-time applications, with an inference speed of 36 fps for 512*512 images on an NVIDIA A6000 GPU.
arXiv Detail & Related papers (2022-03-08T13:36:23Z) - Homography Decomposition Networks for Planar Object Tracking [11.558401177707312]
Planar object tracking plays an important role in AI applications, such as robotics, visual servoing, and visual SLAM.
We propose a novel Homography Decomposition Networks(HDN) approach that drastically reduces and stabilizes the condition number by decomposing the homography transformation into two groups.
arXiv Detail & Related papers (2021-12-15T06:13:32Z) - Self-supervised Geometric Perception [96.89966337518854]
Self-supervised geometric perception is a framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels.
We show that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.
arXiv Detail & Related papers (2021-03-04T15:34:43Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Dense Non-Rigid Structure from Motion: A Manifold Viewpoint [162.88686222340962]
Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames.
We show that our approach significantly improves accuracy, scalability, and robustness against noise.
arXiv Detail & Related papers (2020-06-15T09:15:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.