What is the best way to extract tabular data from a PDF?

Posted by:

What is the best way to extract tabular data from a PDF?

What is the best way to extract tabular data from a PDF? The open source way to tackle this task usually involves the pdftotext command-line tool from the poppler-utils package (this is how it is called in Debian Linux; see http://poppler.freedesktop.org for source code).

Because looks really do matter.

Warnock’s vision is alive, well, and evolving. When you save a document or image as an Adobe PDF, it looks just the way you intended it to. While many PDFs are simply pictures of pages, Adobe PDFs preserve all the data in the original file – even when text, graphics, spreadsheets, and more are combined in a single file.
What is the best way to extract tabular data from a PDF

It’s all about security.

When you work with electronic documents, it’s important to make protection a part of your routine. You can password protect your PDFs to prevent others from copying and editing. Redact them to permanently delete sensitive information. And even find and remove hidden data.

We invented PDF.

In 1991, Adobe cofounder Dr. John Warnock launched the paper-to-digital revolution with an idea he called The Camelot Project. The goal was to enable anyone to capture documents from any application, send electronic versions of these documents anywhere, and view and print them on any machine. By 1992, Camelot had developed into PDF. Today, it is the format trusted by businesses around the world.

This invocation works well for me:
pdftotext -nopgbrk -layout input.pdf output.txt

The resulting file, here called output.txt, contains plain text with the formatting approximately left intact. Now you can (manually or otherwise) save the tables from this file into files with .csv, .tsv or .dat endings, and with any luck, R’s read.table() function or the software of your choice will accept the formatting as it is. Otherwise, you will need to do some postprocessing/postediting.

0

Add a Comment