too much configuration, or: you shouldn’t have to tweak parameters

Whenever I encounter a paper like this – Towards automatic optimization of MapReduce programs – and there are a lot of them, I find myself sighing inwardly. (Heck, we even had a student of ours who ended up tweaking a bunch of these knobs: http://www.news.cs.nyu.edu/node/146).

This seems to be a common refrain in Java programs, but Hadoop especially – rather then either choosing a sensible constant, or adapting a value at runtime, let’s foist all of the work onto the user. But the way it’s phrased is clever – we’re not avoiding the decision, we’re just making it so the user can configure things however they want. I’ve done this a lot myself – it’s just so easy to add a flag to your command line or to your config file and pride yourself on a job well (not) done.

The key issue here is that, as a user, I don’t know what to put in for these values, I don’t know what’s important to change, and so I’m the absolute worst person to be responsible for these things.

Seriously, why are you giving the user these parameters to tweak?

  • io.sort.record.percent
  • io.sort.factor
  • mapreduce.job.split.metainfo.maxsize
  • mapred.heartbeats.in.second

What inevitably happens is we don’t know what any of these things actually mean when it comes to making things faster, so we end up searching the internet for the magic numbers to plug in, rerunning our jobs a whole bunch and wasting a crap-load of time.

This is not a desirable user experience. I mean, here’s the interface a car exposes to me:

There’s a “go faster” pedal and a “go slower” pedal. These correspond to all sorts of complicated, dynamic magic inside of the engine compartment, but I don’t need to know about them – the system handles it for me. Moreover, it can adjust parameters at runtime, in response to the behavior of the car – unlike most of our lazy computer programs!

If only our programs could be more like cars (though hopefully with better gas mileage).

personal package management

Occasionally I find I need a package that isn’t in my distribution, or I need to rebuild from source for whatever reason. In the past, I’ve always been conflicted about that age old question:

Where do I install this bugger?

The default for most packages, /usr/local, is fine for most purposes, but then there are annoyances – if I want to use this package on other machines, then I’d be better to put it under ~ (/home/power), (our cluster has a shared NFS mount). But then I’m filling up my home directory with random bins and sbins and includes and if/when I need to uninstall something I always get confused and blow away the wrong thing (since all of the binaries end up in /home/power/bin)

Installing into subdirectories (/home/power/my-package) has it’s own annoyances – I have to make sure to update my $PATH everytime, and I start to get confused when there are too many things in my home directory (I don’t know why, I just do!).

Fortunately, I’ve come across a nice solution to all this. I install everything into /home/power/pkg:

power@kermit> ls -l pkg                                                                                                                                                                                
...
drwxrwxr-x  7 power power 4096 Nov 19 19:42 openmpi
drwxrwxr-x  6 power power 4096 Jul 24 18:48 oprofile
drwxr-xr-x  6 power power 4096 Feb 19  2011 paperpile
drwxrwxr-x  4 power power 4096 Nov  6  2011 parallel
drwxrwxr-x  5 power power 4096 May  9  2012 perl5
drwxrwxr-x  6 power power 4096 Apr 15  2012 postgresql
drwxr-xr-x  7 power power 4096 Feb  9  2012 pypy-1.8
drwxr-xr-x  7 power power 4096 Jun  7 12:50 pypy-1.9
drwxr-xr-x  6 power power 4096 Apr 27  2012 python-2.7

And add the following to my bashrc:

for d in /home/power/pkg/*/bin; do export PATH=$PATH:$d done
for d in /home/power/pkg/*/lib; do 
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$d 
done

Now every time I install a package under ‘pkg’, my PATH is automatically updated to discover it. And if do need to remove a package, I just blow away the directory.

I won’t claim that this is brilliant or original. But it works for me, so… it’s nice.

bash-isms i wish i knew earlier

I’ve been using bash for far too long (> 10 years now), and I was reminded the other day (after helping a colleague with some coding) of some of the random tricks that you pick up over time. I’ve put them into a (almost completely wrong) timeline. Are there any other common idioms that I’m forgetting to put in here?

Day 1:

“Oh man, look at that, I can hit up-arrow and repeat the previous command!”

You’re too excited by the novelty that you don’t pay attention to the fact it’s slower then retyping the old command.

Day 30:

After again hitting up-arrow 30+times to find an old command, you accidentally come upon reverse i-search (CTRL+R) and realize what it means.

You start using quick substitution:

echo foo
^foo^bar

And are feeling pretty much like a master.

Month 6:

HERE documents are now completely under your control, and you have started writing scripts to try to automate everything, even operations that you know you’ll only have to do once.

You spend at least one entire day fiddling with themes, in the name of “productivity”.

Year 1:

You’re feeling more confident, and moreover, the reckless abandon of your Bash youth seems to have passed. After a brief spell with the vi key bindings, you’re back to the emacs bindings, but feeling invigorated by your exploration. You realize that there is a separation between inputrc and bashrc, but you don’t really have time to investigate further. After all, you just added

set completion-query-items 10000
set completion-ignore-case On

to your .inputrc and are far too excited about the idea of never being asked:

Display all 3425 possibilities? (y or n)?

every again.

Year 2:

export HISTSIZE=1000000
export HISTFILESIZE=1000000
shopt -s histappend

How could it have taken you so long to search for this? No longer will having multiple terminals open cause you to lose your hard earned history. You anticipate the point in time where you will have accumulated so many commands in your history file that you will never have to type a new one.

Year 3:

You dabble with ZSH after seeing a friends super colorful console. You give up after you realize that zsh is missing most of the awesome TAB-completions you are by now accustomed to. By accident, you try tab-completing an scp command and are floored by the fact it’s actually based on the remote filesystem. You start trying to write your own completion scripts, but realize these are things better left to experts. It is slowly dawning on you that you are not an expert.

You’ve also become more confident in your escapes – you feel not the slightest bit scared about using arithmetic now:

MYPORT=$((PORTBASE + 1))

And you use the subshell escape result=$(echo foo) to differentiate yourself from those silly backtick users who don’t know what they’re missing:

ROOT=$(dirname $(dirname $(readlink -f $0)))

Year 5:

You accidentally hit CTRL+X CTRL+E again, but this time you noticed the magic keystrokes that got you here. An $EDITOR window for modifying the command line? How cool is this? Now your Awk scripts will become even more powerful (and ever more incomprehensible).

Year 10:

Your scripts have started to become Zen-like koans of existential beauty. Your full knowledge of the power of trap EXIT allows you to impress your neighbors, whose adulation you accept with a wry smile. You know when to CTRL- (SIGQUIT) and when to CTRL-C (SIGINT) – you use force only when required.

You have come to the realization that you are just a beginner, and have so much more to learn.

mycloud – cluster computing in the small

I spent a little time cleaning up mycloud recently – it’s a Python library that gives you a simple map/mapreduce interface without any setup (just SSH access).

I’ve been using it a lot for little processing tasks – it saves me a lot of time over running things using just my machine, or having to switch over to writing Hadoop code.

Source is on github, and the package is available on PyPi for easy installation:

pip install [--user] mycloud

Begin verbatim README dump

MyCloud makes parallelizing your existing Python code using local machines easy – all you need is SSH access to a machine and you too can be part of this whole cloud revolution!

Usage

Starting your cluster:

import mycloud cluster = mycloud.Cluster(['machine1', 'machine2'])

# or use defaults from ~/.config/mycloud
# cluster = mycloud.Cluster()

Map over a list:

result = cluster.map(compute_factors, range(1000))

Use the MapReduce interface to easily handle processing of larger datasets:

from mycloud.mapreduce import MapReduce, group
from mycloud.resource import CSV

input_desc = [CSV('/path/to/my_input_%d.csv') % i for i in range(100)]
output_desc = [CSV('/path/to/my_output_file.csv')]

def map_identity(kv_iter, output):
  for k, v in kv_iter:
    output(k, int(v[0]))

def reduce_sum(kv_iter, output):
  for k, values in group(kv_iter):
    output(k, sum(values))

mr = MapReduce(cluster, map_identity, reduce_sum,
               input_desc, output_desc)

result = mr.run()
for k, v in result[0].reader():
  print k, v

Performance

It is, keep in mind, written entirely in Python.

Some simple operations I’ve used it for (6 machines, 96 cores):

  • Sorting a billion numbers: ~5 minutes
  • Preprocessing 1.3 million images (resizing and SIFT feature extraction): ~1 hour

Input formats

Mycloud has builtin support for processing the following file types:

  • LevelDB
  • CSV
  • Text (lines)
  • Zip

Adding support for your own is simple – just write a resource class describing how to get a reader and writer. (see resource.py for details).

Why?!?

Sometimes you’re developing something in Python (because that’s what you do), and you decide you’d like it to be parallelized. Our current options are multiprocessing (limiting us to a single machine) and Hadoop streaming (limiting us to strings and Hadoop’s input formats).

Also, because I could.

Credits

MyCloud builds on the phenomonally useful cloud serialization, SSH/Paramiko, and LevelDB libraries.

office buildings in the US: cont’d

Tip for future office developers: save money and make people happier by not completing trying to override mother nature!

That’s right – when it’s 40-50 degrees F outside, please please please stop heating my office to 80! And when it’s 90 outside, please stop cooling my office to 50… the planet and I will thank you.

bigtable as a spreadsheet

I’m toying with the idea of creating BigTable/HBase extension that exposes tables as a gigantic virtual spreadsheet.

Then, following the spreadsheet paradigm, I should be able to enter formulas for columns/cells and have them be calculated dynamically based on the data. This would be similar in concept to a database view.

Now, if someone could just go ahead and make an open-source version of Spanner it would simplify this a lot for me…

log-spaced values with numpy

I knew this had to exist, since otherwise generated logarithmic plots in matplotlib would be a pain in the butt. Still, it took a bit of searching, although perhaps just the name should have clued me in.

 fig, ax = plt.subplots()
 steps = N.log10(N.logspace(0.9, 1-1e-5))
 ax.set_yscale('log', basex=10)
 ax.plot(steps, f(steps), '-')

Also, a shout-out for the ipython inline graphs (ipython notebook --pylab inline). Beautiful, and I can copy-paste them into emails and google docs!

Speedy – a fast python rpc library

I just pushed the source for speedy, a non-blocking rpc library, to github. While there’s nothing super-new about it, I’ve found it useful in a few projects I’ve worked on, and so I pulled it out for others to make use of.

Speedy is non-blocking on the client and server side, relatively fast (~50k-requests/s), and straightforward to use. (The name is incidental – it was originally called HTTP/RPC, but it didn’t use HTTP, then jsonrpc, but it doesn’t use JSON, so it ended up being speedy, which at least is somewhat accurate.)

Installation is easy via pip or easy_install:

pip install speedy

Usage is straightforward. On the server side, you initialize the server with a handler object: methods are called on the handler object in response to RPC calls. Each method call on the server side takes a ‘handle’ as the first argument. When a result is ready, simply call handle.done() with the value to return.

On the client side, speedy adopts the Futures model found in some other libraries. Client side methods return immediately with a future object. The future object can be queried for completion (Future.done()) and waited on (Future.wait()). The non-blocking behavior is very convenient you want to execute a few hundred things in parallel and then wait for them.

Server
class MyHandler(object):
    def foo(self, handle, arg1, arg2):
        handle.done(do_something(arg1, arg2))

import rpc.server
s = rpc.server.RPCServer('localhost', 9999, handler=MyHandler())
s.start()
Client
import rpc.client
c = rpc.client.RPCClient('localhost', 9999)
future = c.foo('Some data', 'would go here')
assert future.wait() == 'Expected result.'

Why not eventlet/greenlet?

Eventlet adds user-level threading to existing code, and theoretically would be a perfect way to implement non-blocking RPC’s in a blocking style. Unfortunately, in my experience, it just doesn’t work enough of the time. Existing RPC libraries are not necessarily designed with massively multi-threaded systems, and the various issues that crop up when using eventlet make debugging just too painful. It’s better and faster just to handle non-blocking calls explicitly – if you as a user want a blocking interface, then by all means feel free to put eventlet on top of speedy.