The Impact of Visual Information in Chinese Characters: Evaluating Large Models' Ability to Recognize and Utilize Radicals
- URL: http://arxiv.org/abs/2410.09013v2
- Date: Thu, 17 Oct 2024 17:30:52 GMT
- Title: The Impact of Visual Information in Chinese Characters: Evaluating Large Models' Ability to Recognize and Utilize Radicals
- Authors: Xiaofeng Wu, Karl Stratos, Wei Xu,
- Abstract summary: We evaluate Large Language Models' and Vision-Language Models' understanding of visual elements in Chinese characters.
Our results reveal that models surprisingly exhibit some, but still limited, knowledge of the visual information.
We observe consistent improvement in Part-Of-Speech tagging when providing additional information about radicals.
- Score: 17.24821720084663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The glyphic writing system of Chinese incorporates information-rich visual features in each character, such as radicals that provide hints about meaning or pronunciation. However, there has been no investigation into whether contemporary Large Language Models (LLMs) and Vision-Language Models (VLMs) can harness these sub-character features in Chinese through prompting. In this study, we establish a benchmark to evaluate LLMs' and VLMs' understanding of visual elements in Chinese characters, including radicals, composition structures, strokes, and stroke counts. Our results reveal that models surprisingly exhibit some, but still limited, knowledge of the visual information, regardless of whether images of characters are provided. To incite models' ability to use radicals, we further experiment with incorporating radicals into the prompts for Chinese language processing (CLP) tasks. We observe consistent improvement in Part-Of-Speech tagging when providing additional information about radicals, suggesting the potential to enhance CLP by integrating sub-character information.
Related papers
- Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations.
We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Dynamic Multi-View Fusion Mechanism For Chinese Relation Extraction [12.818297160055584]
We propose a mixture-of-view-experts framework (MoVE) to dynamically learn multi-view features for Chinese relation extraction.
With both the internal and external knowledge of Chinese characters, our framework can better capture the semantic information of Chinese characters.
arXiv Detail & Related papers (2023-03-09T07:35:31Z) - Language identification as improvement for lip-based biometric visual
systems [13.205817167773443]
We present a preliminary study in which we use linguistic information as a soft biometric trait to enhance the performance of a visual (auditory-free) identification system based on lip movement.
We report a significant improvement in the identification performance of the proposed visual system as a result of the integration of these data.
arXiv Detail & Related papers (2023-02-27T15:44:24Z) - Stroke-Based Autoencoders: Self-Supervised Learners for Efficient
Zero-Shot Chinese Character Recognition [4.64065792373245]
We develop a stroke-based autoencoder to model the sophisticated morphology of Chinese characters.
Our SAE architecture outperforms other existing methods in zero-shot recognition.
arXiv Detail & Related papers (2022-07-17T14:39:10Z) - Zero-shot Cross-Linguistic Learning of Event Semantics [27.997873309702225]
We look at captions of images across Arabic, Chinese, Farsi, German, Russian, and Turkish.
We show that lexical aspects can be predicted for a given language despite not having observed any annotated data for this language at all.
arXiv Detail & Related papers (2022-07-05T23:18:36Z) - Visual Clues: Bridging Vision and Language Foundations for Image
Paragraph Captioning [78.07495777674747]
We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training.
Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image.
We use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image.
arXiv Detail & Related papers (2022-06-03T22:33:09Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.