Testing PDFs with Python

Table of Contents

[TOC]

Overview

When you generate PDFs, you need a way to test their integrity—not only must they be valid, but they should behave correctly and display consistently, even on different platforms. This article describes how you can use the PyPDF2 library to test your PDF files for broken links (both internal and external), and how to find fonts that are not embedded in the PDF.

Note that some PDF readers are ‘smart’ and will create a live hyperlink from a string of text that looks like a URL, even though the text is not coded as a URL in the PDF file. The technique described in this article does not address this issue—it only tests the actual URLs present in the PDF file.

Code from this article is available as a gist on GitHub. Get a copy and play along.

Testing Internal and External Links

To test the links, we’ll create a function to check the urls found inside the PDF file. This function uses the requests library, which you can install with pip.

There may be some ftp links, which are not directly supported by the requests library. You can either use the urllib package as in the following code, or you can use the requests-ftp package, available on pypi.

from PyPDF2 import PdfFileReader
from pprint import pprint
import requests
import sys
import urllib

The check_ftp function checks a given url for a response. If it fails, or if the response is empty, it returns False, along with the reason; otherwise it returns True.

def check_ftp(url):
    try:
        response = urllib.urlopen(url)
    except IOError as e:
        result, reason = False, e
    else:
        if response.read():
            result, reason = True, 'okay'
        else:
            result, reason = False, 'Empty Page'
    return result, reason

The check_url function is also simple: If the url starts with ftp, it delegates to the check_ftp function. Otherwise, it attempts to get the url with some timeout value using typical header values. The the function returns the response along with the reason it succeeded or failed.

def check_url(url, auth=None):
    headers = {'User-Agent': 'Mozilla/5.0', 'Accept': '*/*'}
    if url.startswith('ftp://'):
        result, reason = check_ftp(url)
    else:
        try:
            response = requests.get(url, timeout=6, auth=auth, headers=headers)
        except (requests.ConnectionError,
                requests.HTTPError,
                requests.Timeout) as e:
            result, reason = False, e
        else:
            if response.text:
                result, reason = response.status_code, response.reason
            else:
                result, reason = False, 'Empty Page'

    return result, reason

Now that we have this utility, we can check the PDF file. We will create four lists:

  • links The internal PDF links in the file; for example, a reference to a section or figure.

  • badlinks Of the internal links in the file, these are links that target a missing destination (broken link).

  • urls The links from the PDF to an external location; for example, a hyperlink to a web site.

  • badurls Of the external links in the file, these are the urls that target a missing destination (broken url)

Now for the PyPDF2 goodies. The following check_pdf function loops over the pages in the PDF file object. For each page, it walks through the Annots dictionary. If that dictionary has an action (\A) with a key of \D (destination?), that is an internal link, so update the links list with the destination.

If the dictionary has an action with a key of \URI, it is an external link. Check the external links with the check_url function and update the urls and bad_urls lists.

After checking each page, get a list of all the anchors in the PDF with the getNamedDestinations attribute; compare that list of all known anchors to the list of internal links we just created. If there is a link with no matching anchor, that link belongs in the badlinks list.

def check_pdf(pdf):
    links = list()
    urls = list()
    badurls = list()

    for page in pdf.pages:
        obj = page.getObject()
        for annot in [x.getObject() for x in obj.get('/Annots', [])]:
            dst = annot['/A'].get('/D')
            url = annot['/A'].get('/URI')
            if dst:
                links.append(dst)
            elif url:
                urls.append(url)
                result, reason = check_url(url)
                if not result:
                    badurls.append({'url':url, 'reason': '%r' % reason})

    anchors = pdf.namedDestinations.keys()
    badlinks = [x for x in links if x not in anchors]
    return links, badlinks, urls, badurls

Finally, make the code into a callable script that takes a single argument, the path to the PDF file. Then print the results of the check_pdf function on stdout.

if __name__ == '__main__':
    fname = sys.argv[1]
    print 'Checking %s' % fname
    pdf = PdfFileReader(fname)
    links, badlinks, urls, badurls = check_pdf(pdf)
    print 'urls: ', urls
    print
    print 'bad links: ', badlinks
    print
    print 'bad urls: ',badurls

Test for Embedded Fonts

Test to make sure that the fonts used in the PDF file are embedded. If a font is not embedded, your PDF file may display differently on different machines, even if it is a font that is putatively “standard”, like Times Roman or Helvetica. To insure that your PDF displays as intended on any machine, all fonts must be embedded.

In the following code, the walk function is a recursive function that takes a dictionary-like object (obj) and two sets (fnt and emb). It walks the given dictionary object: for every key in the given dictionary, the function calls itself on the corresponding value (if that value is a nested dictionary).

If the dictionary has a key called BaseFont, the value corresponding to that key is the name of a font used in the PDF; add that font name to the fnt set of fonts used.

If the dictionary has a key called FontName, the dictionary is a descriptor for that font, so check for another key in the same font descriptor dictionary that begins with FontFile (the key could be FontFile, FontFile2, or FontFile3). If that key exists, the font is embedded; add that font name to the set of fonts embedded.

If the two sets are not identical, there are unembedded fonts in the PDF.

fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])

def walk(obj, fnt, emb):
    if '/BaseFont' in obj:
        fnt.add(obj['/BaseFont'])

    elif '/FontName' in obj and fontkeys.intersection(set(obj)):
        emb.add(obj['/FontName'])

    for k in obj:
        if hasattr(obj[k], 'keys'):
            walk(obj[k], fnt, emb)

    return fnt, emb

Finally, make the code into a callable script that takes a single argument, the path to the PDF file.

Start with two empty sets, fonts and embedded. Open the file with PyPDF2. The library gives us access to the internal structure of the PDF. We loop over each page in the PDF, passing the page’s Resources dictionary to the walk function, described above. Add the corresponding results to the two sets and calculate the unembedded fonts by differencing the sets.

Print the fonts used in the PDF file and if there are unembedded fonts, print their names as well. Of course here you can do anything you want with the information such as save it to test database, print a report, and so on.

if __name__ == '__main__':
    fname = sys.argv[1]
    pdf = PdfFileReader(fname)
    fonts = set()
    embedded = set()

    for page in pdf.pages:
        obj = page.getObject()
        f, e = walk(obj['/Resources'], fonts, embedded)
        fonts = fonts.union(f)
        embedded = embedded.union(e)

    unembedded = fonts - embedded
    print 'Font List'
    pprint(sorted(list(fonts)))
    if unembedded:
        print '\nUnembedded Fonts'
        pprint(unembedded)

Using PyPDF2 Methods

Obviously, the more you can specify about the PDFs you produce, the more you can test. For example, you may know that your PDF should have specific metadata, should be encrypted, contain a certain number of pages, and so on.

You can test for those conditions with the built-in tools that the PDFFileReader in pyPDF2 provides. If you have a PDFFileReader instance, you can use the following properties for testing:

documentInfo
returns the document metadata such as author, creator, producer, subject, and title.
isEncrypted
returns boolean value specifiying whether the document is encrypted
numPages
returns the number of pages in the document

Summary

If you produce PDF documents, you need to test them. The more you can specify about your PDFs, the more you can test. This article describes how you can test that the links (internal and external) are valid and that the fonts used in the document are embedded.

Get a copy of this code and play: Code from this article is available as a gist on GitHub.

Do you test other things in your own PDF documents? Leave a comment!

links

social