CRAFT: Extracting and Tuning Cultural Instructions from the Wild
- URL: http://arxiv.org/abs/2405.03138v2
- Date: Wed, 10 Jul 2024 02:54:23 GMT
- Title: CRAFT: Extracting and Tuning Cultural Instructions from the Wild
- Authors: Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, Nancy F. Chen,
- Abstract summary: This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora.
We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction.
We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%.
- Score: 38.255242754975654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
Related papers
- Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models [22.92083941222383]
We introduce DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans.
We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models.
Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems.
arXiv Detail & Related papers (2024-07-02T08:55:41Z) - Methodology of Adapting Large English Language Models for Specific Cultural Contexts [10.151487049108626]
We propose a rapid adaptation method for large models in specific cultural contexts.
The adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values.
arXiv Detail & Related papers (2024-06-26T09:16:08Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies [53.2331634010413]
CultureBank is a knowledge base built upon users' self-narratives.
It contains 12K cultural descriptors sourced from TikTok and 11K from Reddit.
We offer recommendations for future culturally aware language technologies.
arXiv Detail & Related papers (2024-04-23T17:16:08Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - Enhancing Content Moderation with Culturally-Aware Models [9.890160776193616]
This work introduces a flexible framework that enhances foundation language models with cultural knowledge.
We evaluate this framework in a case study of an online podcast platform with content spanning various regions.
arXiv Detail & Related papers (2023-12-05T00:11:09Z) - Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in
Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs)
LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z) - EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background
Prediction in English [25.38572483508948]
We augment natural language processing models with cultural background features.
We show that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US.
Our findings support the importance of cultural background modeling to a wide variety of NLP tasks and demonstrate the applicability of EnCBP in culture-related research.
arXiv Detail & Related papers (2022-03-28T04:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.