When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona
Dialogue Corpus
- URL: http://arxiv.org/abs/2304.00350v1
- Date: Sat, 1 Apr 2023 16:10:36 GMT
- Title: When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona
Dialogue Corpus
- Authors: Won Ik Cho, Yoon Kyung Lee, Seoyeon Bae, Jihwan Kim, Sangah Park,
Moosung Kim, Sowon Hahn, Nam Soo Kim
- Abstract summary: Building a natural language dataset requires caution since word semantics is vulnerable to subtle text change or the definition of the annotated concept.
In this study, we tackle these issues when creating a large-scale open-domain persona dialogue corpus.
- Score: 13.051107304650627
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Building a natural language dataset requires caution since word semantics is
vulnerable to subtle text change or the definition of the annotated concept.
Such a tendency can be seen in generative tasks like question-answering and
dialogue generation and also in tasks that create a categorization-based
corpus, like topic classification or sentiment analysis. Open-domain
conversations involve two or more crowdworkers freely conversing about any
topic, and collecting such data is particularly difficult for two reasons: 1)
the dataset should be ``crafted" rather than ``obtained" due to privacy
concerns, and 2) paid creation of such dialogues may differ from how
crowdworkers behave in real-world settings. In this study, we tackle these
issues when creating a large-scale open-domain persona dialogue corpus, where
persona implies that the conversation is performed by several actors with a
fixed persona and user-side workers from an unspecified crowd.
Related papers
- Learning From Free-Text Human Feedback -- Collect New Datasets Or Extend
Existing Ones? [57.16050211534735]
We investigate the types and frequency of free-text human feedback in commonly used dialog datasets.
Our findings provide new insights into the composition of the datasets examined, including error types, user response types, and the relations between them.
arXiv Detail & Related papers (2023-10-24T12:01:11Z) - Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective.
We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way.
We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z) - Grounding in social media: An approach to building a chit-chat dialogue
model [9.247397520986999]
Building open-domain dialogue systems capable of rich human-like conversational ability is one of the fundamental challenges in language generation.
Current work on knowledge-grounded dialogue generation primarily focuses on persona incorporation or searching a fact-based structured knowledge source such as Wikipedia.
Our method takes a broader and simpler approach, which aims to improve the raw conversation ability of the system by mimicking the human response behavior on social media.
arXiv Detail & Related papers (2022-06-12T09:01:57Z) - SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues [22.89699254073016]
How smoothly transitioning from social chatting to task-oriented dialogues is important for triggering business opportunities.
This paper proposes a framework to automatically generate many dialogues without human involvement.
The released data has a great potential of guiding future research directions and commercial activities.
arXiv Detail & Related papers (2022-04-22T09:31:13Z) - Detecting Speaker Personas from Conversational Texts [52.4557098875992]
We study a new task, named Speaker Persona Detection (SPD), which aims to detect speaker personas based on the plain conversational text.
We build a dataset for SPD, dubbed as Persona Match on Persona-Chat (PMPC)
We evaluate several baseline models and propose utterance-to-profile (U2P) matching networks for this task.
arXiv Detail & Related papers (2021-09-03T06:14:38Z) - Linguistic Characterization of Divisive Topics Online: Case Studies on
Contentiousness in Abortion, Climate Change, and Gun Control [11.127421264715556]
divisive topics prompt both contentious and non-contentious conversations.
We focus on conversations from highly divisive topics (abortion, climate change, and gun control)
We operationalize a set of novel linguistic and conversational characteristics and user factors, and incorporate them to build interpretable models.
arXiv Detail & Related papers (2021-08-30T23:55:38Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - Learning to Select Context in a Hierarchical and Global Perspective for
Open-domain Dialogue Generation [15.01710843286394]
We propose a novel model with hierarchical self-attention mechanism and distant supervision to detect relevant words and utterances in short and long distances.
Our model significantly outperforms other baselines in terms of fluency, coherence, and informativeness.
arXiv Detail & Related papers (2021-02-18T11:56:42Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data
and Methodology [68.8836704199096]
Corpus-based conversational interfaces are able to generate more diverse and natural responses than template-based or retrieval-based agents.
With their increased generative capacity of corpusbased conversational agents comes the need to classify and filter out malevolent responses.
Previous studies on the topic of recognizing and classifying inappropriate content are mostly focused on a certain category of malevolence.
arXiv Detail & Related papers (2020-08-21T22:43:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.