3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for
Embodied Turn-Taking Prediction
- URL: http://arxiv.org/abs/2310.14859v3
- Date: Thu, 21 Dec 2023 18:19:58 GMT
- Title: 3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for
Embodied Turn-Taking Prediction
- Authors: Mehdi Fatan, Emanuele Mincato, Dimitra Pintzou, Mariella Dimiccoli
- Abstract summary: We propose a new multimodal transformer-based architecture for predicting turn-taking in embodied, synchronized multi-perspective data.
Our experimental results on the recently introduced EgoCom dataset show a substantial performance improvement of up to 14.01% on average.
- Score: 4.342241136871849
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Predicting turn-taking in multiparty conversations has many practical
applications in human-computer/robot interaction. However, the complexity of
human communication makes it a challenging task. Recent advances have shown
that synchronous multi-perspective egocentric data can significantly improve
turn-taking prediction compared to asynchronous, single-perspective
transcriptions. Building on this research, we propose a new multimodal
transformer-based architecture for predicting turn-taking in embodied,
synchronized multi-perspective data. Our experimental results on the recently
introduced EgoCom dataset show a substantial performance improvement of up to
14.01% on average compared to existing baselines and alternative
transformer-based approaches. The source code, and the pre-trained models of
our 3M-Transformer will be available upon acceptance.
Related papers
- Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training.
Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z) - MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction [5.8919870666241945]
We present a Multiscleimat Transformer (MART) network for multi-agent trajectory prediction.
MART is a hypergraph transformer architecture to consider individual and group behaviors in transformer machinery.
In addition, we propose an Adaptive Group Estor (AGE) designed to infer complex group relations in real-world environments.
arXiv Detail & Related papers (2024-07-31T14:31:49Z) - Towards Multi-modal Transformers in Federated Learning [10.823839967671454]
This paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain.
We introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients.
arXiv Detail & Related papers (2024-04-18T19:04:27Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - TransFusion: A Practical and Effective Transformer-based Diffusion Model
for 3D Human Motion Prediction [1.8923948104852863]
We propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction.
Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers.
In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization, we treat all inputs, including conditions, as tokens to create a more lightweight model.
arXiv Detail & Related papers (2023-07-30T01:52:07Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction [72.37440317774556]
We propose advances that address two key challenges in future trajectory prediction.
multimodality in both training data and predictions and constant time inference regardless of number of agents.
arXiv Detail & Related papers (2020-07-26T08:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.