Open Scene Understanding: Grounded Situation Recognition Meets Segment
Anything for Helping People with Visual Impairments
- URL: http://arxiv.org/abs/2307.07757v1
- Date: Sat, 15 Jul 2023 09:41:27 GMT
- Title: Open Scene Understanding: Grounded Situation Recognition Meets Segment
Anything for Helping People with Visual Impairments
- Authors: Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan
Chen, Kailun Yang, Rainer Stiefelhagen
- Abstract summary: Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way.
We propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities.
Our model achieves state-of-the-art performance on the SWiG dataset.
- Score: 23.673073261701226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounded Situation Recognition (GSR) is capable of recognizing and
interpreting visual scenes in a contextually intuitive way, yielding salient
activities (verbs) and the involved entities (roles) depicted in images. In
this work, we focus on the application of GSR in assisting people with visual
impairments (PVI). However, precise localization information of detected
objects is often required to navigate their surroundings confidently and make
informed decisions. For the first time, we propose an Open Scene Understanding
(OpenSU) system that aims to generate pixel-wise dense segmentation masks of
involved entities instead of bounding boxes. Specifically, we build our OpenSU
system on top of GSR by additionally adopting an efficient Segment Anything
Model (SAM). Furthermore, to enhance the feature extraction and interaction
between the encoder-decoder structure, we construct our OpenSU system using a
solid pure transformer backbone to improve the performance of GSR. In order to
accelerate the convergence, we replace all the activation functions within the
GSR decoders with GELU, thereby reducing the training duration. In quantitative
analysis, our model achieves state-of-the-art performance on the SWiG dataset.
Moreover, through field testing on dedicated assistive technology datasets and
application demonstrations, the proposed OpenSU system can be used to enhance
scene understanding and facilitate the independent mobility of people with
visual impairments. Our code will be available at
https://github.com/RuipingL/OpenSU.
Related papers
- SADG: Segment Any Dynamic Gaussian Without Object Trackers [39.77468734311312]
SADG, Segment Any Dynamic Gaussian Without Object Trackers, is a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs.
We learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining.
We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes.
arXiv Detail & Related papers (2024-11-28T17:47:48Z) - REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding.
We present REACT, an architecture inspired by the transformer encoder-decoder model.
Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z) - Visual In-Context Prompting [100.93587329049848]
In this paper, we introduce a universal visual in-context prompting framework for both vision tasks like open-set segmentation and detection.
We build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points.
Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities.
arXiv Detail & Related papers (2023-11-22T18:59:48Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Deep Learning Computer Vision Algorithms for Real-time UAVs On-board
Camera Image Processing [77.34726150561087]
This paper describes how advanced deep learning based computer vision algorithms are applied to enable real-time on-board sensor processing for small UAVs.
All algorithms have been developed using state-of-the-art image processing methods based on deep neural networks.
arXiv Detail & Related papers (2022-11-02T11:10:42Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Scalable Perception-Action-Communication Loops with Convolutional and
Graph Neural Networks [208.15591625749272]
We present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI)
Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addressing agent-level visual perception and feature learning.
We demonstrate that VGAI yields performance comparable to or better than other decentralized controllers.
arXiv Detail & Related papers (2021-06-24T23:57:21Z) - Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional
Architectures in a Contextual Approach for Video-Based Visual Emotion
Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild.
Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction.
We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z) - Perception Framework through Real-Time Semantic Segmentation and Scene
Recognition on a Wearable System for the Visually Impaired [27.04316520914628]
We present a multi-task efficient perception system for the scene parsing and recognition tasks.
This system runs on a wearable belt with an Intel RealSense LiDAR camera and an Nvidia Jetson AGX Xavier processor.
arXiv Detail & Related papers (2021-03-06T15:07:17Z) - Active Visual Localization in Partially Calibrated Environments [35.48595012305253]
Humans can robustly localize themselves without a map after they get lost following prominent visual cues or landmarks.
In this work, we aim at endowing autonomous agents the same ability. Such ability is important in robotics applications yet very challenging when an agent is exposed to partially calibrated environments.
We propose an indoor scene dataset ACR-6, which consists of both synthetic and real data and simulates challenging scenarios for active visual localization.
arXiv Detail & Related papers (2020-12-08T08:00:55Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.