Machine Identification of High Impact Research through Text and Image
Analysis
- URL: http://arxiv.org/abs/2005.10321v1
- Date: Wed, 20 May 2020 19:12:24 GMT
- Title: Machine Identification of High Impact Research through Text and Image
Analysis
- Authors: Marko Stamenovic, Jeibo Luo
- Abstract summary: We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
- Score: 0.4737991126491218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The volume of academic paper submissions and publications is growing at an
ever increasing rate. While this flood of research promises progress in various
fields, the sheer volume of output inherently increases the amount of noise. We
present a system to automatically separate papers with a high from those with a
low likelihood of gaining citations as a means to quickly find high impact,
high quality research. Our system uses both a visual classifier, useful for
surmising a document's overall appearance, and a text classifier, for making
content-informed decisions. Current work in the field focuses on small datasets
composed of papers from individual conferences. Attempts to use similar
techniques on larger datasets generally only considers excerpts of the
documents such as the abstract, potentially throwing away valuable data. We
rectify these issues by providing a dataset composed of PDF documents and
citation counts spanning a decade of output within two separate academic
domains: computer science and medicine. This new dataset allows us to expand on
current work in the field by generalizing across time and academic domain.
Moreover, we explore inter-domain prediction models - evaluating a classifier's
performance on a domain it was not trained on - to shed further insight on this
important problem.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Analyzing the State of Computer Science Research with the DBLP Discovery
Dataset [0.0]
We conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata.
We introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations.
Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future.
arXiv Detail & Related papers (2022-12-01T16:27:42Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Paperswithtopic: Topic Identification from Paper Title Only [5.025654873456756]
We present a dataset of papers paired by title and sub-field from the field of artificial intelligence (AI)
We also present results on how to predict a paper's AI sub-field from a given paper title only.
For the transformer models, we also present gradient-based, attention visualizations to further explain the model's classification process.
arXiv Detail & Related papers (2021-10-09T06:32:09Z) - Small data problems in political research: a critical replication study [5.698280399449707]
We show that the small data causes the classification model to be highly sensitive to variations in the random train-test split.
We also show that the applied preprocessing causes the data to be extremely sparse.
Based on our findings, we argue that A&W's conclusions regarding the automated classification of organizational reputation tweets can not be maintained.
arXiv Detail & Related papers (2021-09-27T09:55:58Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.