October 2009
M T W T F S S
« Sep   Nov »
 1234
567891011
12131415161718
19202122232425
262728293031  

PDF to text in python

PDFMiner is a suite of programs in python that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

I downloaded a PDF from work which has my call stats in tabular form. It worked great.

[25] 21:56:47–> pdf2txt.py -t html test.pdf > test.htm
/Library/Python/2.6/site-packages/pdfminer-20091004-py2.6.egg/pdfminer/pdfparser.py:8: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5, struct

My project is to:
1. Email my weekly stats from work.
2. Construct a python program to read the data into a temporary textfile
3. Extract the data into a SQLite file
4. Produce statistics graphs on my average call times.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>