Using Full-Text Content to Characterize and Identify Best Seller Books
- URL: http://arxiv.org/abs/2210.02334v2
- Date: Thu, 11 May 2023 12:37:00 GMT
- Title: Using Full-Text Content to Characterize and Identify Best Seller Books
- Authors: Giovana D. da Silva, Filipi N. Silva, Henrique F. de Arruda, B\'arbara
C. e Souza, Luciano da F. Costa and Diego R. Amancio
- Abstract summary: We consider the task of predicting whether a book will become a best seller from the standpoint of literary works.
Dissimilarly from previous approaches, we focused on the full content of books and considered visualization and classification tasks.
Our results show that it is unfeasible to predict the success of books with high accuracy using only the full content of the texts.
- Score: 0.6442904501384817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artistic pieces can be studied from several perspectives, one example being
their reception among readers over time. In the present work, we approach this
interesting topic from the standpoint of literary works, particularly assessing
the task of predicting whether a book will become a best seller. Dissimilarly
from previous approaches, we focused on the full content of books and
considered visualization and classification tasks. We employed visualization
for the preliminary exploration of the data structure and properties, involving
SemAxis and linear discriminant analyses. Then, to obtain quantitative and more
objective results, we employed various classifiers. Such approaches were used
along with a dataset containing (i) books published from 1895 to 1924 and
consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii)
literary works published in the same period but not being mentioned in that
list. Our comparison of methods revealed that the best-achieved result -
combining a bag-of-words representation with a logistic regression classifier -
led to an average accuracy of 0.75 both for the leave-one-out and 10-fold
cross-validations. Such an outcome suggests that it is unfeasible to predict
the success of books with high accuracy using only the full content of the
texts. Nevertheless, our findings provide insights into the factors leading to
the relative success of a literary work.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - STONYBOOK: A System and Resource for Large-Scale Analysis of Novels [11.304581370821756]
Books have historically been the primary mechanism through which narratives are transmitted.
We have developed a collection of resources for the large-scale analysis of novels.
arXiv Detail & Related papers (2023-11-06T23:46:40Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Whodunit? Learning to Contrast for Authorship Attribution [22.37948005237967]
Authorship attribution is the task of identifying the author of a given text.
We propose to fine-tune pre-trained language representations using a combination of contrastive learning and supervised learning.
We show that Contra-X advances the state-of-the-art on multiple human and machine authorship attribution benchmarks.
arXiv Detail & Related papers (2022-09-23T23:45:08Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - Book Success Prediction with Pretrained Sentence Embeddings and
Readability Scores [8.37609145576126]
We propose a model that leverages pretrained sentence embeddings along with various readability scores for book success prediction.
Our proposed model outperforms strong baselines for this task by as large as 6.4% F1-score points.
arXiv Detail & Related papers (2020-07-21T20:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.