How do you extract text data from PDF files?

Posted by:

How do you extract text data from PDF files?

How do you extract text data from PDF files? Check out Apache Tika .

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

For Tika, PDF is just one type out of thousand other document types it is capable of extracting. It can extract textual content as well as metadata of documents. So, the effort you invest in learning it will be useful for lot many other tasks (say you want to do same thing with PPT, DOC or other document tomorrow, you don’t need to worry about finding a new library again!)

I see this question also tagged with Web Crawling. Tika is internally used by Apache Nutch to extract the content from various documents on web.

How do you extract text data from PDF files

Goodness of Tika in brief:

  • It has command line interface to test out quickly
    Example :
    java -jar target/tika-app-1.13-SNAPSHOT.jar -t ~/ebooks/Machine\ Learning\ in\ Action.pdf
  • written in Java and available in maven repository as a library.
  • It has a REST API interface
  • Python client
  • It has a very active mailing list to reach to when you have questions
  • It is licenced under Apache Licence 2.0 which gives you complete freedom.
0

Add a Comment