What is the way to convert a PDF document to CSV format using Python?
There are many ways to convert PDF document to CSV format using Python and some are displayed here:
- OCR the pdf using python tesseract open source OCR if PDF is not readable.
- Read the pdf content using pypdf2 or pdfminer libraries.
- Prettify text using beautifulsoup if necessary.
- Load the data into pandas data frame.
- Export data into CSV using pandas.
You can use this API to convert PDF to CSV using Python. The tool uses an algorithm to ‘see’ tables and hence outputs data from PDFs accurately:
In your terminal/command line, install the PDFTables Python library with:
pip install git+https://github.com/pdftables/python-pdftables-api.git
If GitHub is not recognized, download it here. Then, run the above command again.
Or if you’d prefer to install it manually, you can download it from python-pdftables-api then install it with:
python setup.py install
Create a new Python script then add the following code:
import pdftables_api c = pdftables_api.Client('my-api-key') c.xlsx('input.pdf', 'output') #replace c.xlsx with c.csv to convert to CSV
Now, you’ll need to make the following changes to the script:
- Substitute with
my-api-keyyour PDFTables API key, which you can get here.
- Replace with
input.pdfthe PDF you would like to convert.
- Displace with
outputthe name you’d like to give the converted document.
Now, save your finished script as in
convert-pdf.py the same directory as the PDF document you want to convert.
Open your command line/terminal and change your directory (e.g.
cd C:/Users/Bob) to the folder you saved your
convert-pdf.py script and PDF in, then run the following command:
To find your converted spreadsheet, navigate to the folder in your file explorer and hey presto, you’ve converted a PDF to Excel or CSV with Python!