Related papers: CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

URL: http://arxiv.org/abs/2403.10082v1
Date: Fri, 15 Mar 2024 07:51:35 GMT
Title: CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
Authors: Tingbing Yan, Wenzheng Zeng, Yang Xiao, Xingyu Tong, Bo Tan, Zhiwen Fang, Zhiguo Cao, Joey Tianyi Zhou,
Abstract summary: We propose to leverage text description generated from large language models to guide feature learning. We first utilize the global text description to guide the skeleton encoder focus on informative joints. We build non-local interaction between local text and joint features, to form the final global representation.
Score: 41.001366870464636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.

Related papers

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
AGL-NET: Aerial-Ground Cross-Modal Global Localization with Varying Scales [45.315661330785275]
We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. We tackle two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view.
arXiv Detail & Related papers (2024-04-04T04:12:30Z)
Fusion of Global and Local Knowledge for Personalized Federated Learning [75.20751492913892]
In this paper, we explore personalized models with low-rank and sparse decomposition. We propose a two-stage-based algorithm named textbfFederated learning with mixed textbfSparse and textbfRank representation. Under proper assumptions, we show that the GKR trained by FedSLR can at least sub-linearly converge to a stationary point of the regularized problem.
arXiv Detail & Related papers (2023-02-21T23:09:45Z)
RING++: Roto-translation Invariant Gram for Global Localization on a Sparse Scan Map [20.276334172402763]
We propose RING++ which has roto-translation invariant representation for place recognition, and global convergence for both rotation and translation estimation. With the theoretical guarantee, RING++ is able to address the large viewpoint difference using a lightweight map with sparse scans. This is the first learning-free framework to address all subtasks of global localization in the sparse scan map.
arXiv Detail & Related papers (2022-10-12T07:49:24Z)
3DGTN: 3D Dual-Attention GLocal Transformer Network for Point Cloud Classification and Segmentation [21.054928631088575]
This paper presents a novel point cloud representational learning network, called 3D Dual Self-attention Global Local (GLocal) Transformer Network (3DGTN) The proposed framework is evaluated on both classification and segmentation datasets.
arXiv Detail & Related papers (2022-09-21T14:34:21Z)
Coalescing Global and Local Information for Procedural Text Understanding [70.10291759879887]
A complete procedural understanding solution should combine three core aspects: local and global views of the inputs, and global view of outputs. In this paper, we propose Coalescing Global and Local InformationCG, a new model that builds entity and time representations. Experiments on a popular procedural text understanding dataset show that our model achieves state-of-the-art results.
arXiv Detail & Related papers (2022-08-26T19:16:32Z)
Attention guided global enhancement and local refinement network for semantic segmentation [5.881350024099048]
A lightweight semantic segmentation network is developed using the encoder-decoder architecture. A Global Enhancement Method is proposed to aggregate global information from high-level feature maps. A Local Refinement Module is developed by utilizing the decoder features as the semantic guidance. The two methods are integrated into a Context Fusion Block, and based on that, a novel Attention guided Global enhancement and Local refinement Network (AGLN) is elaborately designed.
arXiv Detail & Related papers (2022-04-09T02:32:24Z)
A Variational Hierarchical Model for Neural Cross-Lingual Summarization [85.44969140204026]
Cross-lingual summarization () is to convert a document in one language to a summary in another one. Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model. We propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder.
arXiv Detail & Related papers (2022-03-08T02:46:11Z)
PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space. Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.