Automated table parsing: PDF2Excel2JSON

Tables are one of the natural ways to organize contents, be it in our mind or on paper. Hence, tabular data are ubiquitous and come in different formats, depending on how tables were created in the first place, e.g., PDF, Excel, HTML, XML, Images, etc.

Many downstream data processing tasks such as knowledge base construction (KBC) require tables to be represented in a structured format. However, common formats to store tabular information such as PDF only contain information that allows correct rendering, but no underlying machine-readable structure.

Depending on the way the file was originated, PDF files can be grouped into [1]:

  • - „true“/digitally created PDFs,
  • - scanned/“image-only“ PDFs (without an underlying text layer, not searchable),
  • - searchable PDFs (text layer added resulting from OCR).

Recently, the state-of-the-art for table extraction [3] has shifted towards deep learning-based approaches [4,5]. A major limiting factor is the lack of available training data for such systems.

In previous work, we built a tool for automatic visual table structure detection and a GUI for manual inspection and ground truth annotation. The machine learning model was trained on a corpus with tex files and PDF files from arxiv.

The goal of this master thesis is to adapt the above-mentioned baseline system to a new corpus of tabular data. We aim to combine PDF documents with an existing set of corresponding well-defined text structures (e.g., Excel and HTML) to automatically generate document annotations.

In the proposed automatic supervision framework, we leverage the existing source files to generate document renderings (left), tabular annotations for such renderings (center) and the corresponding structured representation (right).

To facilitate the motivating research we aim to build an end-to-end system, where a structured table representation (e.g., JSON, XML) is generated from a given input of a PDF file.

The automatic supervision procedure is structured as follow:

  • - a PDF page will be processed by PDFMiner [2] to generate initial cell location candidates in a page (text content + bounding boxes for individual words/numbers);
  • - an Excel sheet (cell location) will be preprocessed so that the position (row/column) and content of each table cell can be matched with the PDFMiner bounding boxes;
  • - a system learns from the mappings and infers a possible table layout given a PDF input;
  • - heuristics and manual adjustments are made with the annotation GUI to correct potential systematic errors and create validation/test data;
  • - the system is useful to render tabular structures in authentic PDF (created via Excel) or scanned PDF into a structured table representation in JSON.

More than 10,000 documents in the following formats could be provided to build a training pipeline: PDF and Excel. Some feature engineering steps are expected in preprocessing the Excel documents, e.g., obtaining bounding boxes of different categories (title, cell) using VB, see example 1, example 2.

References

  1. [1] „Types of PDFs“ (last accessed: Nov. 15, 2018).
  2. [2] PDFMiner (last accessed: Nov. 15, 2018).
  3. [3]ICDAR 2013 Table Competition (Note: The problem of limited training data was not solved.)
  4. [4]DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images
  5. [5]DeCNT: Deep Deformable CNN for Table Detection