Related papers: SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

Related papers

Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z)
FastJAM: a Fast Joint Alignment Model for Images [10.522943649619426]
Joint Alignment of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations.<n>We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks.
arXiv Detail & Related papers (2025-10-26T21:21:27Z)
Agile Tradespace Exploration for Space Rendezvous Mission Design via Transformers [22.891825351056823]
Spacecraft rendezvous enables on-orbit servicing and debris removal, forming the foundation for a scalable space economy.<n>This paper proposes a framework that can be used to design missions for a wide range of flight times.<n>The framework provides high-quality initial guesses that generalize to solutions in fewer iterations.
arXiv Detail & Related papers (2025-10-03T22:28:46Z)
The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z)
Accurate and Efficient Low-Rank Model Merging in Core Space [39.05680982515462]
Core Space merging framework enables the merging of LoRA-adapted models within a common alignment basis.<n>We show that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks.
arXiv Detail & Related papers (2025-09-22T13:48:15Z)
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression [15.784158079414235]
FLAT-LLM is a training-free structural compression method based on fine-grained low-rank transformations in the activation space.<n>It achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes.
arXiv Detail & Related papers (2025-05-29T19:42:35Z)
Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation. MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision [3.574664325523221]
We propose textbfContrast, a hybrid SR model that combines textbfConvolutional, textbfTransformer, and textbfState Space components. By integrating transformer and state space mechanisms, textbfContrast compensates for the shortcomings of each approach, enhancing both global context modeling and pixel-level accuracy.
arXiv Detail & Related papers (2025-01-23T03:34:14Z)
Empowering Snapshot Compressive Imaging: Spatial-Spectral State Space Model with Across-Scanning and Local Enhancement [51.557804095896174]
We introduce a State Space Model with Across-Scanning and Local Enhancement, named ASLE-SSM, that employs a Spatial-Spectral SSM for global-local balanced context encoding and cross-channel interaction promoting. Experimental results illustrate ASLE-SSM's superiority over existing state-of-the-art methods, with an inference speed 2.4 times faster than Transformer-based MST and saving 0.12 (M) of parameters.
arXiv Detail & Related papers (2024-08-01T15:14:10Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering [5.016335384639901]
Multi-modal input of Audio-Visual Question Answering (AVQA) makes feature extraction and fusion processes more challenging. We propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.
arXiv Detail & Related papers (2024-06-14T08:43:31Z)
LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis [15.793704096341523]
In this paper, we introduce LInK, a novel framework that integrates contrastive learning of performance and design space with optimization techniques. By leveraging a multimodal and transformation-invariant contrastive learning framework, LInK learns a joint representation that captures complex physics and design representations of mechanisms. Our results demonstrate that LInK not only advances the field of mechanism design but also broadens the applicability of contrastive learning and optimization to other areas of engineering.
arXiv Detail & Related papers (2024-05-31T03:04:57Z)
Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking [51.16677396148247]
Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues. Our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack.
arXiv Detail & Related papers (2023-08-01T18:53:24Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
SpaceNet: Make Free Space For Continual Learning [15.914199054779438]
We propose a novel architectural-based method referred as SpaceNet for class incremental learning scenario. SpaceNet trains sparse deep neural networks from scratch in an adaptive way that compresses the sparse connections of each task in a compact number of neurons. Experimental results show the robustness of our proposed method against catastrophic forgetting old tasks and the efficiency of SpaceNet in utilizing the available capacity of the model.
arXiv Detail & Related papers (2020-07-15T11:21:31Z)
Overlapping Spaces for Compact Graph Representations [17.919759296265]
Various non-trivial spaces are becoming popular for embedding structured data such as graphs, texts, or images. We generalize the concept of product space and introduce an overlapping space that does not have the configuration search problem.
arXiv Detail & Related papers (2020-07-05T20:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.