Related papers: RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild

RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild

URL: http://arxiv.org/abs/2211.01165v1
Date: Wed, 2 Nov 2022 14:39:56 GMT
Title: RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild
Authors: Weiyao Wang, Byung-Hak Kim, Varun Ganapathi
Abstract summary: We introduce RegCLR, a new self-supervised framework that combines contrastive and regularized methods. RegCLR is compatible with the standard Vision Transformer architecture. Several real-world table recognition scenarios have been shown to benefit greatly from the representations learned from this new framework.
Score: 1.5187073409911216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in self-supervised learning (SSL) using large models to learn visual representations from natural images are rapidly closing the gap between the results produced by fully supervised learning and those produced by SSL on downstream vision tasks. Inspired by this advancement and primarily motivated by the emergence of tabular and structured document image applications, we investigate which self-supervised pretraining objectives, architectures, and fine-tuning strategies are most effective. To address these questions, we introduce RegCLR, a new self-supervised framework that combines contrastive and regularized methods and is compatible with the standard Vision Transformer architecture. Then, RegCLR is instantiated by integrating masked autoencoders as a representative example of a contrastive method and enhanced Barlow Twins as a representative example of a regularized method with configurable input image augmentations in both branches. Several real-world table recognition scenarios (e.g., extracting tables from document images), ranging from standard Word and Latex documents to even more challenging electronic health records (EHR) computer screen images, have been shown to benefit greatly from the representations learned from this new framework, with detection average-precision (AP) improving relatively by 4.8% for Table, 11.8% for Column, and 11.1% for GUI objects over a previous fully supervised baseline on real-world EHR screen images.

Related papers

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval [83.01358520910533]
We introduce a new framework that can boost the performance of large-scale pre-trained vision- curation models. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple mapping network, to predict a set of visual prompts. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks.
arXiv Detail & Related papers (2025-02-21T18:59:57Z)
SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval [7.248145893361865]
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. In this work, we propose SCOT, a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network.
arXiv Detail & Related papers (2025-01-12T07:23:49Z)
Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task. Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z)
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling [11.634154932876719]
Masked Image Modeling has emerged as a powerful self-supervised learning paradigm for visual representation learning. We propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset. Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning.
arXiv Detail & Related papers (2024-11-16T03:21:06Z)
Boosting Federated Domain Generalization: Understanding the Role of Advanced Pre-Trained Architectures [27.386915138058416]
We study the efficacy of advanced pre-trained architectures, such as Vision Transformers (ViT), ConvNeXt, and Swin Transformers, in enhancing Federated Domain Generalization. We evaluate different variants of these architectures, using extensive pre-training datasets such as ImageNet-1K, ImageNet-21K, JFT-300M, and ImageNet-22K. We observe that certain variants of these advanced models, despite having fewer parameters, outperform larger ResNet models.
arXiv Detail & Related papers (2024-09-20T14:09:05Z)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z)
Can Generative Models Improve Self-Supervised Representation Learning? [0.7999703756441756]
We introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations. Our results show that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks.
arXiv Detail & Related papers (2024-03-09T17:17:07Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z)
Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning. We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view. We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z)
Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning. We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
Robust Cross-Modal Representation Learning with Progressive Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets. We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z)
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR. Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z)
Towards Unsupervised Sketch-based Image Retrieval [126.77787336692802]
We introduce a novel framework that simultaneously performs unsupervised representation learning and sketch-photo domain alignment. Our framework achieves excellent performance in the new unsupervised setting, and performs comparably or better than state-of-the-art in the zero-shot setting.
arXiv Detail & Related papers (2021-05-18T02:38:22Z)
A Simple Framework for Contrastive Learning of Visual Representations [116.37752766922407]
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We show that composition of data augmentations plays a critical role in defining effective predictive tasks. We are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet.
arXiv Detail & Related papers (2020-02-13T18:50:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.