robust pdf title extraction

I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting.  For example, these are the documents I’ve recently downloaded:

wodet3-paper12.pdf
jong_afst.pdf
tut_gpu_2012_03.pdf
lecture1-1.pdf
natella_binary_sfi_edcc_2012.pdf
TR-Farrukh-58.pdf
730959.pdf
NLSEmagic_Paper.pdf
M23584378H1770Q2.pdf
G89T37P10W263075.pdf
journal_online.pdf
manus_Jour-INFORMATION-Camera.pdf
12011.VitekJan.Paper.pdf
R3X8722476T2X278.pdf
1203.0321.pdf

I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.

I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.

Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution.  It also resembles a rube-goldberg machine.  After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format.  From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:

robust pdf title extraction

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s