Python — How to extract text from PDF files easily

Camila Pozas García
2 min readSep 5, 2022

--

A step-wise guide on how I convert PDF to text.

Do you need to extract text from a pdf file but you don’t want to set the limits manually or run it through a long code? Maybe you should use the pdf2image and tesserocr libraries.

In my case, I had to read a lot of pdfs and unstructured ones so some libraries didn’t work for me. Also, I like to use temporary folders, so I think this works very well.

Install steps

  1. Install pdf2image
pip install pdf2image

2. Install poppler, this depends on the OS.

a. Mac:

brew install poppler

b. Windows: will have to build or download poppler for Windows.

c. Linux: Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils.

3. Install tesserocr

a. On Debian/Ubuntu we need these requirements:

apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

b. Installation

pip install tesserocr

c. You may need the tessdata to have better precision, here you can get for many languages.

Steps to get it up and running

  1. Read from path in a temporal folder

Where:

2. Read text from the images, following the example above

And that’s it, if you want to apply it as a function check this out.

On the other hand, if you have questions do not hesitate to contact me, happy to answer.

--

--

Camila Pozas García

Software engineer, writing about coding, creativity, and everything in between. ✨