email habits over time

I was curious if my sleeping/waking habits had really changed over the years – I definitely don’t feel I work as late now as when I was 22, but it’s hard to tell. To test this, I looked over all of the timestamps of mail I’ve sent in the past few years and tried to make a pretty graph.

I’m not sure how meaningful it is, but thanks to ggplot, it is pretty, at least.

The plotting code is straightforward — try it out!

robust pdf title extraction

I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting.  For example, these are the documents I’ve recently downloaded:

wodet3-paper12.pdf
jong_afst.pdf
tut_gpu_2012_03.pdf
lecture1-1.pdf
natella_binary_sfi_edcc_2012.pdf
TR-Farrukh-58.pdf
730959.pdf
NLSEmagic_Paper.pdf
M23584378H1770Q2.pdf
G89T37P10W263075.pdf
journal_online.pdf
manus_Jour-INFORMATION-Camera.pdf
12011.VitekJan.Paper.pdf
R3X8722476T2X278.pdf
1203.0321.pdf

I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.

I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.

Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution.  It also resembles a rube-goldberg machine.  After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format.  From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:

tcp timelines with ggplot2

I’ve come across the need to analyze TCP flows from time to time, and while scripts like flowtime and EasyTimeline are nice, they aren’t really, well, pretty.  ggplot2, on the other hand is, and it turns out to be really easy to get nice, somewhat useful plots. Here’s an example conversation between my local browser and nytimes.com: (warning, gigantic) You can easily see the importance of fast DNS resolution, with almost 2 seconds of time spent idle waiting for the first resolver hit.  Then we see a large number of connections opened up, as modern browsers and sites try to work around the small TCP initial congestion window.  Finally there’s the petering out of the connections and the final FIN packets as the browser finishes the page. It’s at least slightly more informative then staring at wireshark dumps, and it provides another excuse to practice my R. The code is pretty straightforward, and mostly dedicated to munging the tshark field output to make streams show up in a reasonable way:

Hello world!

Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

Here are some suggestions for your first post.

  1. You can find new ideas for what to blog about by reading the Daily Post.
  2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
  3. Make some changes to this page, and then hit preview on the right. You can always preview any post or edit it before you share it to the world.