mycloud – cluster computing in the small

I spent a little time cleaning up mycloud recently – it’s a Python library that gives you a simple map/mapreduce interface without any setup (just SSH access).

I’ve been using it a lot for little processing tasks – it saves me a lot of time over running things using just my machine, or having to switch over to writing Hadoop code.

Source is on github, and the package is available on PyPi for easy installation:

pip install [--user] mycloud

Begin verbatim README dump

MyCloud makes parallelizing your existing Python code using local machines easy – all you need is SSH access to a machine and you too can be part of this whole cloud revolution!

Usage

Starting your cluster:

import mycloud cluster = mycloud.Cluster(['machine1', 'machine2'])

# or use defaults from ~/.config/mycloud
# cluster = mycloud.Cluster()

Map over a list:

result = cluster.map(compute_factors, range(1000))

Use the MapReduce interface to easily handle processing of larger datasets:

from mycloud.mapreduce import MapReduce, group
from mycloud.resource import CSV

input_desc = [CSV('/path/to/my_input_%d.csv') % i for i in range(100)]
output_desc = [CSV('/path/to/my_output_file.csv')]

def map_identity(kv_iter, output):
  for k, v in kv_iter:
    output(k, int(v[0]))

def reduce_sum(kv_iter, output):
  for k, values in group(kv_iter):
    output(k, sum(values))

mr = MapReduce(cluster, map_identity, reduce_sum,
               input_desc, output_desc)

result = mr.run()
for k, v in result[0].reader():
  print k, v

Performance

It is, keep in mind, written entirely in Python.

Some simple operations I’ve used it for (6 machines, 96 cores):

  • Sorting a billion numbers: ~5 minutes
  • Preprocessing 1.3 million images (resizing and SIFT feature extraction): ~1 hour

Input formats

Mycloud has builtin support for processing the following file types:

  • LevelDB
  • CSV
  • Text (lines)
  • Zip

Adding support for your own is simple – just write a resource class describing how to get a reader and writer. (see resource.py for details).

Why?!?

Sometimes you’re developing something in Python (because that’s what you do), and you decide you’d like it to be parallelized. Our current options are multiprocessing (limiting us to a single machine) and Hadoop streaming (limiting us to strings and Hadoop’s input formats).

Also, because I could.

Credits

MyCloud builds on the phenomonally useful cloud serialization, SSH/Paramiko, and LevelDB libraries.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s