Splitting Two Column PDFs Using Python

Oct. 15, 2017, 11:29 p.m.

I find it more pleasant to read long documents on the screen of my mobile phone than on my laptop. However, it is very cumbersome to view pdf files that are usually designed for A4 or Letter sized pages, since they do not reflow the text.

Many academic papers have the document split into two columns. Which is usually makes the text legible if you zoom into one column, but, it still involves a lot of cumbersome zooming, and panning, everytime you reach the end of the first column. It also tends to shift every time you try to scroll down.

To fix this frustration, I created a very simple script that chops up pdf a document into the two separate columns. It saves each column as a separate image and then embeds all those images into a single HTML page that stacks all the columns one on top of the other. This allows you to just scroll down continuously, never having to pan sideways.

The script allows you to also chop a portion of the left, right, top and bottom edges of the document so that you get a nice tight fit when viewing on your screen. I should emphasize that it is a very basic script. You will need to do some trial and error to get the settings that work well for the document you use, and the screen you plan on viewing the document on. It also naively chops everything in half. So if there is any content that spans more than one column, eg the abstract, or images and diagrams, then they too will be split in half indiscriminately.

Code

Dependencies:

#! /usr/bin/python
"""
# ==============================================================================
# SPLIT TWO COLUMN PDF INTO HTML PAGE
# Author: Ronny Restrepo
# Origin: http://ronny.rest/blog/post_2017_10_15_two_column_pdfs
# License: MIT License
# ==============================================================================
"""
from __future__ import print_function, division, unicode_literals
import wand.image
import os

# ==============================================================================
#                                                                 PDF2SPLIT_HTML
# ==============================================================================
def pdf2split_html(pdf, saveto, left=0, right=0, top=0, bottom=0, res=100):
    print("- Opening pdf file: ", pdf)
    with(wand.image.Image(filename=pdf, resolution=res)) as document:
        print("- getting pages")
        pages=document.sequence
        n_pages=len(pages)
        width, height, _, _ = pages[0].page
        mid = width//2
        html = []

        print("- creating output dir")
        if not os.path.exists(saveto):
            os.makedirs(saveto)

        print("- splitting pages")
        for i, page in enumerate(pages):
            left_side = page[left:mid, top:height-bottom]
            right_side = page[mid:width-right, top:height-bottom]
            left_side.save(filename=os.path.join(saveto, "{:03d}_a.jpg".format(i)))
            right_side.save(filename=os.path.join(saveto, "{:03d}_b.jpg".format(i)))

            # Append these two images to the html page
            html.append("<img src='{0:03d}_a.jpg'/><br><img src='{0:03d}_b.jpg'/><br>".format(i))

        print("- creating html page")
        with open(os.path.join(saveto, "index.html"), mode = "w") as textFile:
            html = "\n".join(html)
            textFile.write(html)
        print("- DONE!")

if __name__ == '__main__':
    import argparse
    p = argparse.ArgumentParser(description="Converts a 2-column pdf document to an html with images of each column stacked on top of each other")
    p.add_argument("pdf", type=str, help="Input PDF file")
    p.add_argument("out", type=str, help="Output path for document")
    p.add_argument("--res", type=int, default=150, help="resolution to load pdf images as")
    p.add_argument("-l", "--left", type=int, default=0, help="Trim this much from the left border")
    p.add_argument("-r", "--right", type=int, default=0, help="Trim this much from the right border")
    p.add_argument("-t", "--top", type=int, default=0, help="Trim this much from the top border")
    p.add_argument("-b", "--bottom", type=int, default=0, help="Trim this much from the bottom border")
    opt = p.parse_args()

    pdf2split_html(opt.pdf,
                   saveto=opt.out,
                   res=opt.res,
                   left=opt.left,
                   right=opt.right,
                   top=opt.top,
                   bottom=opt.bottom
                   )

Running:

Let's assume you saved the script as pdf2split_html.py then you can run it with the default values as follows:

pdf2split_html "/path/to/doc.pdf" "/path/to/output"

The script creates a new directory "/path/to/output", with the following structure (where NNN is the total number of pages in the pdf document):

/path/to/output
    - index.html
    - 000_a.jpg
    - 000_b.jpg
    - 001_a.jpg
    - 001_b.jpg
    ...
    - NNN_a.jpg
    - NNN_b.jpg

additional arguments to pass to the script:

--res 300  # set the resolution to 300
-l 100     # clip 100 pixels from the left margin
-r 100     # clip 100 pixels from the right margin
-t 100     # clip 100 pixels from the top margin
-b 100     # clip 100 pixels from the bottom margin

Example

These are the settings I used on this academic paper to get a tight fit.

python pdf2split_html.py /tmp/1511.00561.pdf /tmp/segnet_paper --res 300 -l 200 -r 180 -t 100 -b 100

Comments

Note you can comment without any login by:

  1. Typing your comment
  2. Selecting "sign up with Disqus"
  3. Then checking "I'd rather post as a guest"