Include HTML in LaTeX

Introduction

When I first started supporting multiple users of LaTeX in my day job, a question I often heard was “How do I include an HTML table in my document?”. The answer was, of course, “It doesn’t work that way; you can’t do that.”

But I found that, with some restrictions, you actually can include HTML by converting it to PDF and including that in your LaTeX document.

In this article I’ll show how I do that for myself and how you can include HTML pages that keep the original styles, colors and so forth once it has been converted to PDF. And if you need them, you can keep the internal links from the HTML so they work in your final PDF.

wkhtmltopdf

The tool wkhtmltopdf converts HTML to PDF. The wkhtmltopdf website summarizes it:

wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.

There is also a C library, if you’re into that kind of thing.

It’s a great tool. Here’s the command I use (but put it all on one line):

/usr/local/bin/wkhtmltopdf -s Letter --no-background 
    --minimum-font-size 20 --no-outline --no-pdf-compression 
    --disable-external-links --enable-internal-links filename.html filename.pdf

Those options request the conversion to be rendered at letter-size with no background. You can play around with the font size so it matches your documents font size.

Run this and you will have a single PDF with possibly multiple pages, using the same style, colors, and links that are in the HTML.

If You Need to Create an Appendix

If you want the html you are including to be a separate chapter or appendix, then instead of --no-outline, use this command instead:

--outline-depth 2 toc --xsl-style-sheet wk_toc.xsl --toc-header-text "Appendix A"

This asks for a table of contents using an XSL stylesheet with the specified header text. Get my version of the wk_toc.xsl stylesheet here:

gist.github.com/tiarno

Alternatively, you can get the file wk_toc.xsl yourself and if you know a little XSL, you can modify it for your needs. To generate the file, give this command::

wkhtmltopdf --dump-default-toc-xsl

You can read more about the options in the wkhtmltopdf help documentation.

If You Need to Preserve Internal Links

Skip this section if you don’t need to preserve internal links in the original HTML.

If you do need internal links, you’ll find that there is already a package to help with that, the pax package. From the documentation:

If PDF files are included using pdfTEX, PDF annotations are stripped. The pax project offers a solution without altering pdfTEX. A Java program (pax.jar) parses the PDF file that will later be included. The program then writes the data of the annotations into a file that can be read by TEX.

You can find it on CTAN.

This package works to retain the internal links from the included PDF. Unfortunately, wkhtmltopdf creates the PDF in a way that is incompatible with this Java program mentioned in the pax package documentation.

I mention it here to save you going down the rabbit hole of trying to get it to work. You still need the pax package itself. But even though the Java jar file is fantastic for other pdfs, with wkhtmltopdf, you need a different solution.

Enter PAXMaker.

PAXMaker

This little tool is available on my GitHub repository: PAXMaker

It requires the PyPDF2 Python package. As you can read on that repo, you can use it by following these steps:

Convert your html file to pdf with wkhtmltopdf
Run paxmaker.py on the pdf (the program takes a single argument, the name of the pdf file)
In the LaTeX file in which you want this pdf inserted: load the pdfpages and pax packages, and include the pdf as described in the next section.

The paxmaker.py file does for the wkhtmltopdf-generated pdfs what the pax java program does for normal pdfs. It writes a .pax file that will be read by the LaTeX pax package when you compile your final document.

pdfpages

So far you have generated a PDF file from the HTML, possibly preserving internal links, and possibly creating it as an appendix. Now lets get that generated PDF into your LaTeX document.

You include external PDFs in a LaTeX document with the pdfpages package, by Andreas Matthias. Get it on CTAN.

This package simplifies the insertion of external multi-page PDF documents into LaTeX documents. Pages can be freely selected and similar to psnup it is possible to put several logical pages onto each sheet of paper. Furthermore a lot of hypertext features like hyperlinks and article threads are provided. This package supports pdfTeX, VTeX, and XeTeX. With VTeX it is even possible to use this package to insert PostScript files, in addition to PDF files.

Then, at the location you want the original HTML pages to be included, use this tag:

\IfFileExists{filename.pdf}
    {\includepdf[pages=-,addtotoc={1,subsection,2,title,htmllabel}]{filename}} 
    {\par\textcolor{red}{MISSING FILE!}}

The \IfFileExists command executes its first argument if the file exists or its second argument if it doesn’t exist.

So if the file is present, you invoke the includepdf command from the pdfpages package, specifying that it should include all pages (pages=-), it should add a title in the table of contents at the subsection level with the text title.

If the file does not exist, it will put a paragraph in your document with a big red “MISSING FILE” warning.

Summary

You can include HTML in your LaTeX document as follows:

convert your HTML to PDF with wkhtmltopdf
if you need internal links, use PAXMaker to write a helper pax file and make sure you use the pax package in your LaTeX document
use the pdfpages LaTeX package to include the PDF rendering of the HTML file.
compile your final LaTeX document

This is a powerful method to include possibly many HTML pages into a LaTeX document. You can use CSS with the HTML to make the included pages match your document.