Taxonomic survey of Hindi Language NLP systems
- URL: http://arxiv.org/abs/2102.00214v1
- Date: Sat, 30 Jan 2021 11:53:56 GMT
- Title: Taxonomic survey of Hindi Language NLP systems
- Authors: Nikita P. Desai, Prof.(Dr.) Vipul K. Dabhi
- Abstract summary: Natural Language processing (NLP) represents the task of automatic handling of natural human language by machines.
This survey gives a report of the resources and applications available for Hindi language NLP.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Natural Language processing (NLP) represents the task of automatic handling
of natural human language by machines.There is large spectrum of possible
applications of NLP which help in automating tasks like translating text from
one language to other, retrieving and summarizing data from very huge
repositories, spam email filtering, identifying fake news in digital media,
find sentiment and feedback of people, find political opinions and views of
people on various government policies, provide effective medical assistance
based on past history records of patient etc. Hindi is the official language of
India with nearly 691 million users in India and 366 million in rest of world.
At present, a number of government and private sector projects and researchers
in India and abroad, are working towards developing NLP applications and
resources for Indian languages. This survey gives a report of the resources and
applications available for Hindi language NLP.
Related papers
- Decoding the Diversity: A Review of the Indic AI Research Landscape [0.7864304771129751]
Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan.
This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages.
arXiv Detail & Related papers (2024-06-13T19:55:20Z) - SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,
IIT Madras [1.4699314771635081]
Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate.
We are open sourcing SPRING-INX data which has about 2000 hours of legally sourced and manually transcribed speech data for ASR system building in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi and Tamil.
arXiv Detail & Related papers (2023-10-23T07:50:10Z) - An Overview of Indian Spoken Language Recognition from Machine Learning
Perspective [7.27448284043116]
This work is one of the first attempts to present a comprehensive review of the Indian spoken language recognition research field.
In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts.
arXiv Detail & Related papers (2022-11-30T11:03:51Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language [91.79339725967073]
More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
arXiv Detail & Related papers (2022-04-25T18:25:57Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.