October 2009 – stephenhucker.com

PDFMiner is a suite of programs in python that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

I downloaded a PDF from work which has my call stats in tabular form. It worked great.

[25] 21:56:47–> pdf2txt.py -t html test.pdf > test.htm
/Library/Python/2.6/site-packages/pdfminer-20091004-py2.6.egg/pdfminer/pdfparser.py:8: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5, struct

My project is to:
1. Email my weekly stats from work.
2. Construct a python program to read the data into a temporary textfile
3. Extract the data into a SQLite file
4. Produce statistics graphs on my average call times.

PDF to text in python

The Bike has arrived