Abstract: Paragraphs are an important class of document entities. We propose a new
approach for paragraph identification by spatial graph convolution networks
(GCN) applied on OCR text boxes. Two steps, namely line splitting and line
clustering, are performed to extract paragraphs from the lines in OCR results.
Each step uses a beta-skeleton graph constructed from bounding boxes, where the
graph edges provide efficient support for graph convolution operations. With
only pure layout input features, the GCN model size is 3~4 orders of magnitude
smaller compared to R-CNN based models, while achieving comparable or better
accuracies on PubLayNet and other datasets. Furthermore, the GCN models show
good generalization from synthetic training data to real-world images, and good
adaptivity for variable document styles.