What is the way to convert a PDF document to CSV format using Python?
There are many ways to convert PDF document to CSV format using Python and some are displayed here:
METHOD 1:
- OCR the pdf using python tesseract open source OCR if PDF is not readable.
- Read the pdf content using pypdf2 or pdfminer libraries.
- Prettify text using beautifulsoup if necessary.
- Load the data into pandas data frame.
- Export data into CSV using pandas.
METHOD 2:
You can use this API to convert PDF to CSV using Python. The tool uses an algorithm to ‘see’ tables and hence outputs data from PDFs accurately: PDF to Excel API – How it Works — PDFTables
METHOD 3:
Step 1
In your terminal/command line, install the PDFTables Python library with:
pip install git+https://github.com/pdftables/python-pdftables-api.git
If GitHub is not recognized, download it here. Then, run the above command again.
Or if you’d prefer to install it manually, you can download it from python-pdftables-api then install it with:
python setup.py install
Step 2
Create a new Python script then add the following code:
import pdftables_api c = pdftables_api.Client('my-api-key') c.xlsx('input.pdf', 'output') #replace c.xlsx with c.csv to convert to CSV
Now, you’ll need to make the following changes to the script:
- Substitute with
my-api-key
your PDFTables API key, which you can get here. - Replace with
input.pdf
the PDF you would like to convert. - Displace with
output
the name you’d like to give the converted document.
Now, save your finished script as inconvert-pdf.py
the same directory as the PDF document you want to convert.
Step 3
Open your command line/terminal and change your directory (e.g. cd C:/Users/Bob
) to the folder you saved your convert-pdf.py
script and PDF in, then run the following command:
python convert-pdf.py
To find your converted spreadsheet, navigate to the folder in your file explorer and hey presto, you’ve converted a PDF to Excel or CSV with Python!
MAY