apt-get build-dep

This could have saved me a lot of time building packages from source, had I known about it in the past. It’s mentioned on the firefox build page:

apt-get build-dep mypackage

Fetches and installs the build dependencies for a package. So much easier then repeating this over and over:

./configure
...
error: missing libFoo

sudo apt-get install libfoo-dev

./configure
...
error: missing libFoo

quickcite: the lazy academic’s friend

Every academic is familiar with the joy of maintaining correct citations.  Hunting down the paper you intend, constructing the BibTeX entry, naming it, then fixing up the 3 other places in your paper where you referenced the same paper under a slightly different key.  It’s an annoyance.

QuickCite relieves some of that annoyance, by looking up references for you and updating your .bib file automatically.  It handily
slips into your Makefile:

paper.pdf: ...
    quickcite -b paper.bib *.tex
    latex ...

Any missing citation entries are automatically resolved (currently, just via DBLP). For example, in my tex file somewhere I have:

This was a really awesome paper cite{PiccoloPower}.

But this was even better cite{Mapreduce}.

QuickCite will hunt down the appropriate entries for you:

quickcite -b paper.bib paper.tex
...
Missing reference for PowerPiccolo
Result to use for {PowerPiccolo}:
  (0) Skip this citation
  (1) A novel fuzzy system for wind turbines reactive power control.
      Geev Mokryani, Pierluigi Siano, Antonio Piccolo, Vito Calderaro, Carlo Cecati
  (2) Piccolo: Building Fast, Distributed Programs with Partitioned Tables.
      Russell Power, Jinyang Li
  (3) Impact of Chosen Error Criteria in RSS-based Localization: Power vs Distance vs Relative Distance Error Minimization.
      Giuseppe Bianchi, Nicola Blefari-Melazzi, Francesca Lo Piccolo

Missing reference for Mapreduce
Result to use for {Mapreduce}:
...

That’s it! Your bibtex file will be updated with the appropriate entries, so you won’t be queried again.

The source is available on github.

Making HDFS balancing faster

HDFS by default likes to stash one copy of a block on the machine it originated from. This is nice in that you avoid a copy, but not so great when all of your data is originating from a single server (you end up wildly unbalanced).

Thankfully, there is a balancing mechanism provided with Hadoop, but mysteriously, not enabled by default. You can start it running in the background via:

/path/to/hadoop/dir/bin/start-balancer.sh 5

An important detail that I missed originally was the setting of the dfs.balance.bandwidthPerSec flag. By default this limits the amount of bandwidth used for balancing to 1MB/s! No wonder my balancing went so slow. Setting this to a more aggressive 80MB/s:

<property>
  <name>dfs.balance.bandwidthPerSec</name>
  <value>100000000</value>
</property>

Greatly reduces the amount of time it takes to balance.

running a pdf crawler with heritrix

I’ve used the Heritrix web crawler quite a few times in the past.  It’s a great piece of software, and has enough features to handle most crawling tasks with ease.
Recently, I wanted to crawl a whole bunch of PDF’s, and since I didn’t know where the PDF’s were going to come from, Heritrix seemed like a natural fit to help me out.  I’ll go over some of the less intuitive steps:

Download the right version of the crawler

That is to say, version 1.*.  Version 2 seems to have been dropped, and version 3 does not yet have all of the features from version 1 implemented (not to mention, the user interface seems to have gone downhill).

For your convenience, here’s a link to the download page.

Make sure you’re rejecting almost everything

You almost certainly don’t want all the web has to offer.  You only want a tiny fraction of it.  For instance, I use a MatchesRegexpDecideRule to drop any media content with the following expression:

.*.(jpg|jpeg|gif|png|mpg|mpeg|txt|css|js|ppt|JPG|tar.gz|flv|MPG|zip|exe|avi|tvd)$

Similarly, you’ll want to drop pesky calendar like applications:

.*(calendar|/api|lecture).*

And any dynamic pages that want to suck up your bandwidth:

.*?.*

Save only what you need

Heritrix has a nice property of allowing for decision rules to be placed almost anywhere, including just before when a file gets written to disk. To avoid writing files you’re uninterested in, you can request that only certain mimetypes are allowed through – add a default reject rule, and then only accept files you want – in my case pdfs or postscript files:

.*(pdf|postscript).*

Regular expressions are full, not partial matches

You need to ensure your regular expression matches the entire item, not just part of it. This means pre and post-pending

.*

to your normal patterns.

If you’re feeling lazy, you can download the crawl order I used and use it as a base for your crawl. Good luck!

java profiling

The Java profiling world can be a somewhat arcane maze of GUI’s, most of which seem to make things more complex.

Fortunately, it’s actually quite simple to get a usable, sample based CPU profile from any modern JVM. Simply run your program with the additional flags:

-agentlib:hprof=cpu=samples

Now, when your program exits, the JVM will also emit a java.hprof.txt file with a listing of where time was spent. If you pore over that file, you’ll eventually find out where your program was wasting it’s time.

But it turns out there is much simpler option – gprof2dot.py. This lovely little utility can convert your grungy hprof output to a beautiful dot graph. (N.B. I wrote the hprof format importer for gprof2dot, so blame me if it’s wrong.)

For example:

gprof2dot.py < java.hprof.txt | dot -Tpng > profile.png

Gives us back:

 

 

In this case, it appears that:

  • Using Java regular expressions to split things is slow
  • I need to speed up my cosine similarity calculation

xdot.py is another useful program for interactively viewing these graphs: simply feed the output of gprof2dot (or any other graphviz generator) to it:

gprof2dot.py < java.hprof.txt | xdot.py

And now you can scan around your profile image directly.