We have launched EaseAnnotate!

Try Out Now!
Pythonify logo
Python

Python PDF Packages comparison, All you need to know (Updated 2024)

11 min read
pythonify
Python PDF Packages comparison, All you need to know (Updated 2024)

Portable Document Format (PDF) is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems

To understand PDFs properly we need to know that there can be three types of PDF documents:

  • Digitally-born PDF files: The file are created digitally on the computer. It can contain images, texts, links, outline items (a.k.a., bookmarks), JavaScript etc. If you Zoom in a lot, the text still looks sharp.
  • Scanned PDF files: The pages in this pdf contains scanned image. The images are then stored in a PDF file. Hence the file is just a container for those images. You cannot copy the text, you don’t have links, outline items, JavaScript.
  • OCRed PDF files: The scanner ran OCR software and put the recognized text in the background of the image. Hence you can copy the text, but it still looks like a scan. If you zoom in enough, you can recognize pixels.

Fun Fact About PDFs

Most PDF files look like they contain well-structured text. But the reality is that a PDF file does not contain anything that resembles paragraphs, sentences, or even words. When it comes to text, a PDF file is only aware of the characters and their placement.

This makes extracting meaningful pieces of text from PDF files difficult. The characters that compose a paragraph are no different from those that compose the table, the page footer or the description of a figure. Unlike other document formats, like a .txt file or a word document, the PDF format does not contain a stream of text.

A PDF document consists of a collection of objects that together describe the appearance of one or more pages, possibly accompanied by additional interactive elements and higher-level application data. A PDF file contains the objects making up a PDF document along with associated structural information, all represented as a single self-contained sequence of bytes.

In this blog, we will discuss the following 5 libraries, the last one i.e. OCRmyPDF serves a different purpose but is a very useful tool for converting scanned pdf to searchable pdf.

  1. PyPDF
  2. PyMuPDF
  3. PDFminer.six
  4. PDFplumber
  5. OCRmyPDF

Summary of Comparison

FeaturesPyPDFPyMuPDFPDFminer.sixPDFplumber
Main featurePDF processing, Reading and writing PDFs, Applying transformations, Delete pages merging pages, etcPDF manipulation, including advanced rendering and comprehensive data extractionextracting text and other elements from PDFs with a focus on accurate layout analysis and conversion to other text formatsUses PDFminer.six, to extract text, PDFPlumber excels at extracting and analyzing visual and tabular data. provides a more intuitive interface for exploring the layout and structure of a page
Text Extraction(Digital PDF)✔️✔️✔️✔️
Text Extraction(Scanned PDF)Requires external OCR like TesseractRequires external OCR like TesseractRequires external OCR like TesseractRequires external OCR like Tesseract
Table Extraction✔️✔️
Merge PDF, Page Deletion, Addition, Reorder✔️✔️
Transformation like Rotation, Scaling✔️✔️
ComplexityEasiest and simpleFeature-rich but takes time to get things rightModerateEasier than pdfminer
SpeedSlowestFastest and performantModerateModerate

PyPDF

pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming(rotation, scaling, translation) the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

Here are some basic uses of this library

Merging and Transforming PDF files

python
from pypdf import PdfWriter merger = PdfWriter() for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]: merger.append(pdf) #transformation can be applied in page level merger.pages[0].rotate(90) #Transformation can also be applied using the add_transformation method transformation = Transformation().rotate(45) merger.pages[0].add_transformation(transformation) merger.write("merged-pdf.pdf") merger.close()

Text Extraction

We can extract text content from Digitally-born PDF files. Its limitation is that it does not work with scanned pdf.

python
from pypdf import PdfReader reader = PdfReader("example.pdf") page = reader.pages[0] print(page.extract_text())

Pdfminer.six

Pdfminer.six is a Python package for extracting information from PDF documents. As we already know pdf is a binary format that consists of objects without proper structure unlike text files or other more familiar HTML files. pdfminer can help us extract different segments from pdf files like, text, figure, line, rect, and image.

python
from pdfminer.high_level import extract_pages for page_layout in extract_pages("test.pdf"): for element in page_layout: print(element)

Each element will be an LTTextBoxLTFigureLTLineLTRect or an LTImage. Some of these can be iterated further, for example iterating through an LTTextBox will give you an LTTextLine, and these, in turn, can be iterated through to get an LTChar. See the diagram here:

 The output of the layout analysis is a hierarchy of layout objects.

Let’s say we want to extract all of the text. We could do:

python
from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer for page_layout in extract_pages("test.pdf"): for element in page_layout: if isinstance(element, LTTextContainer): print(element.get_text())

Or, we could extract the fontname or size of each individual character:

python
from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer, LTChar for page_layout in extract_pages("test.pdf"): for element in page_layout: if isinstance(element, LTTextContainer): for text_line in element: for character in text_line: if isinstance(character, LTChar): print(character.fontname) print(character.size)

Extract all images from PDF into directory

This command extracts all the images from the PDF and saves them into the my_images_dir directory. If you have properly installed pdfminer.six, you should be able to run this command from the command line.

shell
pdf2txt.py example.pdf --output-dir my_images_dir

PdfPlumber

PDFplumber Plumbs PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. it works best on machine-generated, rather than scanned, PDFs. It uses pdfminer.six under the hood. It adds Table extraction on top of pdfminer.six

Let’s extract a table from this sample pdf file that looks like this:

Untitled

Here. is the example code to extract the Table from the above pdf using PdfPlumber:

python
import pdfplumber import pandas as pd pdf = pdfplumber.open("./sample_pdf_with_table.pdf") page = pdf.pages[0] table = page.extract_table() # Create a dataframe from the table data # table[1:]-> Skip the first row of the table # table[0]-> use the first row as the column names df = pd.DataFrame(table[1:], columns=table[0]) # Remove spaces from the column for column in ["Name", "Age", "Occupation", "City"]: df[column] = df[column].str.replace(" ", "") # Save to CSV df.to_csv("sample_pdf_with_table.csv", index=False)

We get the sample_pdf_with_table.csv file where we extract the table data from the pdf as follows:

shell
Name,Age,Occupation,City John,28,Engineer,NewYork Emily,22,Teacher,LosAngeles Mike,35,Doctor,Chicago Sarah,30,Designer,SanFrancisco Alex,25,Developer,Boston

PyMuPDF

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents. It is based on MuPDF which is a C library for pdf data extraction.

Some of the features it supports are Text extraction, Merging PDFs, Transforming PDFs, Inserting pages, Deleting pages, rearranging pages, etc.PyMuPDF does all this way faster than other libraries. Below I have shown a comparison of the most important features listed on the pymupdf’s official website, to see the full list visit this link.

FeaturePyMuPDFPyPDF2pdfplumber / pdfminer
Supports Multiple Document FormatsPDF XPS EPUB MOBI FB2 CBZ SVG TXT ImagePDFPDF
ImplementationC and PythonPythonPython
Render Document PagesAll document typesNo renderingNo rendering
Extract TextAll document typesPDF onlyPDF only
Extract Tables✔️PDF plumber can
Extract Vector GraphicsAll document typesLimited
Encrypted PDFs✔️LimitedLimited
Integrates with Jupyter and IPython Notebooks✔️✔️
Joining / Merging PDF with other Document TypesAll document typesPDF onlyPDF only
OCR API for Seamless Integration with TesseractAll document types
PDF Form FieldsCreate, read, updateLimited, no creation

Below I have provided snippets for basic features supported by PyMuPDF. To use PyMuPDF, you need to install it using pip. All the features are accessed using the fitz module.

Merging PDF files

To merge PDF files, do the following:

python
import fitz doc_a = fitz.open("a.pdf") # open the 1st document doc_b = fitz.open("b.pdf") # open the 2nd document doc_a.insert_pdf(doc_b) # merge the docs doc_a.save("a+b.pdf") # save the merged document with a new filename

Rotating a PDF

To add a rotation to a page, do the following:

python
import fitz doc = fitz.open("test.pdf") # open document page = doc[0] # get the 1st page of the document page.set_rotation(90) # rotate the page doc.save("rotated-page-1.pdf")

Deleting Pages

To delete a page from a document, do the following:

python
import fitz doc = fitz.open("test.pdf") # open a document doc.delete_page(0) # delete the 1st page of the document doc.save("test-deleted-page-one.pdf") # save the document

Extracting Tables from a Page

Tables can be found and extracted from any document Page

python
import fitz from pprint import pprint doc = fitz.open("test.pdf") # open document page = doc[0] # get the 1st page of the document tabs = page.find_tables() # locate and extract any tables on page print(f"{len(tabs.tables)} found on {page}") # display number of found tables if tabs.tables: # at least one table found? pprint(tabs[0].extract()) # print content of first table/

OCRmyPDF

OCRmyPDF is a Python package and command line application that adds text “layers” to images in PDFs, making scanned image PDFs searchable and copyable/pasteable. It uses OCR to guess the text contained in images. OCRmyPDF also supports plugins that enable customization of its processing steps, and it is highly tolerant of PDFs containing scanned images and “born digital” content that doesn’t require text recognition.

shell
ocrmypdf # it's a scriptable command line program -l eng+fra # it supports multiple languages --rotate-pages # it can fix pages that are misrotated --deskew # it can deskew crooked PDFs! --title "My PDF" # it can change output metadata --jobs 4 # it uses multiple cores by default --output-type pdfa # it produces PDF/A by default input_scanned.pdf # takes PDF input (or images) output_searchable.pdf # produces validated PDF output

Features

  • Converts regular PDFs into searchable PDF/A format.
  • Accurately overlays OCR text on images to facilitate copying and pasting often without altering other content.
  • Maintains original image resolutions.
  • Reduces file sizes through image optimization.
  • Offers image deskewing and cleaning before OCR, if needed.
  • Utilizes all CPU cores for efficient processing.
  • Employs Tesseract OCR to recognize over 100 languages.

Limitations

OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine. These limitations are inherent to any software relying on Tesseract:

  • OCR accuracy may not be perfect and could be inferior to commercial OCR tools.
  • Handwriting recognition is not supported.
  • Accuracy decreases for languages not included in the -l LANG setting.
  • Tesseract may misinterpret the layout, like failing to recognize multiple columns.
  • Tesseract doesn't identify font types or structure text into paragraphs or headings, only providing text and its location.

Conclusion

In conclusion, this comprehensive blog has navigated through the intricate world of PDF processing in Python, showcasing the strengths and particularities of five key libraries: PyPDF, PyMuPDF, PDFminer.six, PDFplumber, and OCRmyPDF. While PyPDF and PyMuPDF excel in general PDF manipulation and rendering, PDFminer.six and PDFplumber are more adept at extracting text and other elements like lines,rects and figures, with PDFplumber offering an intuitive interface for layout exploration on top of pdfminer. OCRmyPDF, on the other hand, stands out for its ability to add searchable text layers to scanned PDFs, enhancing their accessibility and functionality. Each of these tools, with its unique features and capabilities, caters to various needs within the PDF processing spectrum, making them invaluable assets for developers and data analysts dealing with diverse PDF-related challenges. This guide not only illuminates the capabilities of each library but also serves as a navigational beacon for selecting the right tool for specific PDF handling tasks, ensuring efficient and effective outcomes in the realm of digital document processing.

More Reads

Python for Java Programmers
Python

Python for Java Programmers

December 30, 2023