PerSHOP -- A Persian dataset for shopping dialogue systems modeling
- URL: http://arxiv.org/abs/2401.00811v1
- Date: Mon, 1 Jan 2024 16:42:56 GMT
- Title: PerSHOP -- A Persian dataset for shopping dialogue systems modeling
- Authors: Keyvan Mahmoudi, Heshaam Faili
- Abstract summary: We developed a dataset of dialogues in the Persian language through crowd-sourcing.
This dataset contains nearly 22k utterances in 15 different domains and 1061 dialogues.
We proposed some baseline models for natural language understanding tasks.
- Score: 2.3025186469300434
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Nowadays, dialogue systems are used in many fields of industry and research.
There are successful instances of these systems, such as Apple Siri, Google
Assistant, and IBM Watson. Task-oriented dialogue system is a category of
these, that are used in specific tasks. They can perform tasks such as booking
plane tickets or making restaurant reservations. Shopping is one of the most
popular areas on these systems. The bot replaces the human salesperson and
interacts with the customers by speaking. To train the models behind the scenes
of these systems, annotated data is needed. In this paper, we developed a
dataset of dialogues in the Persian language through crowd-sourcing. We
annotated these dialogues to train a model. This dataset contains nearly 22k
utterances in 15 different domains and 1061 dialogues. This is the largest
Persian dataset in this field, which is provided freely so that future
researchers can use it. Also, we proposed some baseline models for natural
language understanding (NLU) tasks. These models perform two tasks for NLU:
intent classification and entity extraction. The F-1 score metric obtained for
intent classification is around 91% and for entity extraction is around 93%,
which can be a baseline for future research.
Related papers
- ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models [1.82618237315022]
We release ToolDial, a dataset comprising 11,111 multi-turn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI.
We simulate dialogues where the system requests necessary information from the user based on API documentation and seeks additional APIs if the user fails to provide the required information.
We evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history.
arXiv Detail & Related papers (2025-03-01T17:23:51Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - DialogZoo: Large-Scale Dialog-Oriented Task Learning [52.18193690394549]
We aim to build a unified foundation model which can solve massive diverse dialogue tasks.
To achieve this goal, we first collect a large-scale well-labeled dialogue dataset from 73 publicly available datasets.
arXiv Detail & Related papers (2022-05-25T11:17:16Z) - KETOD: Knowledge-Enriched Task-Oriented Dialogue [77.59814785157877]
Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat as separate domains.
We investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model.
arXiv Detail & Related papers (2022-05-11T16:01:03Z) - Investigating Effect of Dialogue History in Multilingual Task Oriented
Dialogue Systems [2.695466667982714]
Up to Dec 2021, Alexa, one of the most popular smart speakers around the world, is able to support 9 different languages.
Training a virtual assistant in other languages is often more difficult, especially for those low-resource languages.
We devise an efficient and effective training solution for multilingual task-orientated dialogue systems.
arXiv Detail & Related papers (2021-12-23T02:27:10Z) - Few-Shot Bot: Prompt-Based Learning for Dialogue Systems [58.27337673451943]
Learning to converse using only a few examples is a great challenge in conversational AI.
The current best conversational models are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL)
We propose prompt-based few-shot learning which does not require gradient-based fine-tuning but instead uses a few examples as the only source of learning.
arXiv Detail & Related papers (2021-10-15T14:36:45Z) - Recent Advances in Deep Learning-based Dialogue Systems [12.798560005546262]
We mainly focus on the deep learning-based dialogue systems.
This survey is the most comprehensive and upto-date one at present in the area of dialogue systems and dialogue-related tasks.
arXiv Detail & Related papers (2021-05-10T14:07:49Z) - TicketTalk: Toward human-level performance with end-to-end,
transaction-based dialog systems [10.659519248703273]
We present a data-driven, end-to-end approach to transaction-based dialog systems.
We show that the system performs at near-human levels in terms of verbal response quality and factual grounding accuracy.
We introduce TicketTalk, a movie ticketing dialog dataset with 23,789 annotated conversations.
arXiv Detail & Related papers (2020-12-23T02:43:37Z) - Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG)
A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection.
We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z) - A Large-Scale Chinese Short-Text Conversation Dataset [77.55813366932313]
We present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues)
The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules.
We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively.
arXiv Detail & Related papers (2020-08-10T08:12:49Z) - The Gutenberg Dialogue Dataset [1.90365714903665]
Current publicly available open-domain dialogue datasets offer a trade-off between quality and size.
We build a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian.
arXiv Detail & Related papers (2020-04-27T12:52:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.