Python — How to extract text from PDF files easily
A step-wise guide on how I convert PDF to text.
Do you need to extract text from a pdf file but you don’t want to set the limits manually or run it through a long code? Maybe you should use the pdf2image and tesserocr libraries.
In my case, I had to read a lot of pdfs and unstructured ones so some libraries didn’t work for me. Also, I like to use temporary folders, so I think this works very well.
Install steps
- Install pdf2image
pip install pdf2image
2. Install poppler, this depends on the OS.
a. Mac:
brew install poppler
b. Windows: will have to build or download poppler for Windows.
c. Linux: Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils.
3. Install tesserocr
a. On Debian/Ubuntu we need these requirements:
apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
b. Installation
pip install tesserocr
c. You may need the tessdata to have better precision, here you can get for many languages.
Steps to get it up and running
- Read from path in a temporal folder
Where:
2. Read text from the images, following the example above
And that’s it, if you want to apply it as a function check this out.
On the other hand, if you have questions do not hesitate to contact me, happy to answer.