ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models
- URL: http://arxiv.org/abs/2403.08420v2
- Date: Fri, 29 Aug 2025 08:56:49 GMT
- Title: ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models
- Authors: Zhicheng Wang, Wensheng Liang, Ruiyan Zhuang, Shuai Li, Jianwei Tan, Xiaoguang Ma,
- Abstract summary: Action recognition in industrial environments faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance.<n>We propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability.
- Score: 8.654703129948901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.
Related papers
- FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z) - PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z) - Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods [5.278929538141005]
We benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI) and classical rendering approaches.<n>Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts.<n>Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency
arXiv Detail & Related papers (2025-11-28T14:51:08Z) - Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models [13.32858759983739]
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs.<n>Existing inference-time interventions to mitigate this issue present a challenging trade-off.<n>We present Residual-Update Directed DEcoding Regulation (RUDDER), a framework that steers LVLMs towards visually-grounded generation.
arXiv Detail & Related papers (2025-11-13T13:29:38Z) - DRTA: Dynamic Reward Scaling for Reinforcement Learning in Time Series Anomaly Detection [7.185726339205792]
Anomaly detection in time series data is important for applications in finance, healthcare, sensor networks, and industrial monitoring.<n>We propose a reinforcement learning-based framework that integrates dynamic reward shaping, Variational Autoencoder (VAE), and active learning, called DRTA.<n>Our method uses an adaptive reward mechanism that balances exploration and exploitation by dynamically scaling the effect of VAE-based reconstruction error and classification rewards.
arXiv Detail & Related papers (2025-08-25T20:39:49Z) - Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z) - Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map [50.21082069320818]
We propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision.<n>Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks.<n>Results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data.
arXiv Detail & Related papers (2025-05-06T15:21:36Z) - More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV [58.89234732689013]
CODrone is a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions.<n>It also serves as a new benchmark designed to align with downstream task requirements.<n>We conduct a series of experiments based on 22 classical or SOTA methods to rigorously evaluate CODrone.
arXiv Detail & Related papers (2025-04-28T17:56:02Z) - From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs [23.253571170594455]
Large Language Models (LLMs) have significantly advanced artificial intelligence.<n>This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline.<n>It produces super-tiny online models with enhanced performance and reduced costs.
arXiv Detail & Related papers (2025-04-18T05:25:22Z) - Pretrained LLMs as Real-Time Controllers for Robot Operated Serial Production Line [5.647265893402412]
This work investigates the feasibility of using Large Language Models (LLMs), particularly GPT-4, as a straightforward, adaptable solution for controlling manufacturing systems, specifically, mobile robot scheduling.
We introduce an LLM-based control framework to assign mobile robots to different machines in robot assisted serial production lines, evaluating its performance in terms of system throughput.
While it achieves performance that is on par with state-of-the-art methods like Multi-Agent Reinforcement Learning (MARL), it offers a distinct advantage by delivering comparable throughput without the need for extensive retraining.
arXiv Detail & Related papers (2025-03-05T20:43:49Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)
RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.
RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection [66.05200339481115]
We present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial anomaly detection.
We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset.
With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-10-12T09:16:09Z) - Interpretable Data-driven Anomaly Detection in Industrial Processes with ExIFFI [3.7516053899419104]
Industrial processes aim to streamline operations as much as possible, encompassing the production of the final product.
In light of the emergence of Industry 5.0, a more desirable approach involves providing interpretable outcomes.
This paper presents the first industrial application of ExIFFI, a recently developed approach focused on the production of fast and efficient explanations for the Extended Isolation Forest (EIF) Anomaly detection method.
arXiv Detail & Related papers (2024-05-02T10:23:17Z) - Leveraging Foundation Model Automatic Data Augmentation Strategies and Skeletal Points for Hands Action Recognition in Industrial Assembly Lines [3.0992677770545254]
We developed a strategy for expanding industrial datasets to achieve efficient, high-quality, and large-scale dataset expansion.
We also applied this strategy to video action recognition.
In the "hand movements during wire insertion" scenarios on the actual assembly line, the accuracy of hand action recognition reached 98.8%.
arXiv Detail & Related papers (2024-03-14T02:55:06Z) - Efficiency at Scale: Investigating the Performance of Diminutive
Language Models in Clinical Tasks [2.834743715323873]
We present an investigation into the suitability of different PEFT methods to clinical decision-making tasks.
Our analysis shows that the performance of most PEFT approaches varies significantly from one task to another.
The effectiveness of PEFT methods in the clinical domain is evident, particularly for specialised models which can operate on low-cost, in-house computing infrastructure.
arXiv Detail & Related papers (2024-02-16T11:30:11Z) - A Cost-Sensitive Transformer Model for Prognostics Under Highly
Imbalanced Industrial Data [1.6492989697868894]
This paper introduces a novel cost-sensitive transformer model developed as part of a systematic workflow.
We observed a substantial enhancement in performance compared to state-of-the-art methods.
Our findings highlight the potential of our method in addressing the unique challenges of failure prediction in industrial settings.
arXiv Detail & Related papers (2024-01-16T15:09:53Z) - An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities.
Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool.
We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Deep Learning based pipeline for anomaly detection and quality
enhancement in industrial binder jetting processes [68.8204255655161]
Anomaly detection describes methods of finding abnormal states, instances or data points that differ from a normal value space.
This paper contributes to a data-centric way of approaching artificial intelligence in industrial production.
arXiv Detail & Related papers (2022-09-21T08:14:34Z) - Enhancing the Generalization for Intent Classification and Out-of-Domain
Detection in SLU [70.44344060176952]
Intent classification is a major task in spoken language understanding (SLU)
Recent works have shown that using extra data and labels can improve the OOD detection performance.
This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection.
arXiv Detail & Related papers (2021-06-28T08:27:38Z) - Cycle and Semantic Consistent Adversarial Domain Adaptation for Reducing
Simulation-to-Real Domain Shift in LiDAR Bird's Eye View [110.83289076967895]
We present a BEV domain adaptation method based on CycleGAN that uses prior semantic classification in order to preserve the information of small objects of interest during the domain adaptation process.
The quality of the generated BEVs has been evaluated using a state-of-the-art 3D object detection framework at KITTI 3D Object Detection Benchmark.
arXiv Detail & Related papers (2021-04-22T12:47:37Z) - Anomaly Detection Based on Selection and Weighting in Latent Space [73.01328671569759]
We propose a novel selection-and-weighting-based anomaly detection framework called SWAD.
Experiments on both benchmark and real-world datasets have shown the effectiveness and superiority of SWAD.
arXiv Detail & Related papers (2021-03-08T10:56:38Z) - Costs to Consider in Adopting NLP for Your Business [3.608765813727773]
We show the trade-off between performance gain and the cost across the models to give more insights for AI-pivoting business.
We call for more research into low-cost models, especially for under-resourced languages.
arXiv Detail & Related papers (2020-12-16T13:57:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.