Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2108.03319v1
- Date: Fri, 6 Aug 2021 22:19:09 GMT
- Title: Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning
- Authors: Iou-Jen Liu, Zhongzheng Ren, Raymond A. Yeh, Alexander G. Schwing
- Abstract summary: We study whether scalability can be achieved via a disentangled representation.
We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment.
Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
- Score: 126.57680291438128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Solving complex real-world tasks, e.g., autonomous fleet control, often
involves a coordinated team of multiple agents which learn strategies from
visual inputs via reinforcement learning. Many existing multi-agent
reinforcement learning (MARL) algorithms however don't scale to environments
where agents operate on visual inputs. To address this issue, algorithmically,
recent works have focused on non-stationarity and exploration. In contrast, we
study whether scalability can also be achieved via a disentangled
representation. For this, we explicitly construct an object-centric
intermediate representation to characterize the states of an environment, which
we refer to as `semantic tracklets.' We evaluate `semantic tracklets' on the
visual multi-agent particle environment (VMPE) and on the challenging visual
multi-agent GFootball environment. `Semantic tracklets' consistently outperform
baselines on VMPE, and achieve a +2.4 higher score difference than baselines on
GFootball. Notably, this method is the first to successfully learn a strategy
for five players in the GFootball environment using only visual data.
Related papers
- ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning.
We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Learning Robust Visual-Semantic Embedding for Generalizable Person
Re-identification [11.562980171753162]
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision.
Previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training.
We propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning.
arXiv Detail & Related papers (2023-04-19T08:37:25Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Robust Policies via Mid-Level Visual Representations: An Experimental
Study in Manipulation and Navigation [115.4071729927011]
We study the effects of using mid-level visual representations as generic and easy-to-decode perceptual state in an end-to-end RL framework.
We show that they aid generalization, improve sample complexity, and lead to a higher final performance.
In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed.
arXiv Detail & Related papers (2020-11-13T00:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.