Related papers: SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features

SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features

URL: http://arxiv.org/abs/2512.00088v1
Date: Wed, 26 Nov 2025 12:58:21 GMT
Title: SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features
Authors: Mohammad Zare,
Abstract summary: SemImage is a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs)<n>In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

Related papers

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation [51.31795451147935]
We present a unified generative model that supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework.<n>Our goal is to achieve unification along three axes: the model, the tasks, and the representations.<n> Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment.
arXiv Detail & Related papers (2025-11-21T03:02:10Z)
Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation [48.642826318384294]
Contrastive vision-language models, such as CLIP, have demonstrated excellent zero-shot capability across semantic recognition tasks.<n>This paper presents a new multimodal disentangled representation learning framework, which leverages disentangled text to guide image disentanglement.
arXiv Detail & Related papers (2025-03-04T02:36:48Z)
InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.<n>We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.<n>InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z)
UniGS: Unified Representation for Image Generation and Segmentation [105.08152635402858]
We use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation.
arXiv Detail & Related papers (2023-12-04T15:59:27Z)
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph [27.77392307623526]
Multi-modal aspect-based sentiment classification (MABSC) is task of classifying the sentiment of a target entity mentioned in a sentence and an image. Previous methods failed to account for the fine-grained semantic association between the image and the text. We propose a new approach called SeqCSG, which enhances the encoder-decoder sentiment classification framework using sequential cross-modal semantic graphs.
arXiv Detail & Related papers (2022-08-19T16:04:29Z)
Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization [112.68171734288237]
We propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels. We learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images. We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization.
arXiv Detail & Related papers (2021-04-12T21:41:25Z)
Cross-domain Correspondence Learning for Exemplar-based Image Translation [59.35767271091425]
We present a framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain. The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar. We show that our method is superior to state-of-the-art methods in terms of image quality significantly.
arXiv Detail & Related papers (2020-04-12T09:10:57Z)
Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context. The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.