Transaction Chain Visualization

We had a paper at the last SOSP on transaction chains.  Our original analysis of chains was done by hand, which is quite a silly way to do it.  We then wrote a simple script to do the graph analysis, but it’s still difficult to picture the interaction of chains (a script telling you that you have an S-C cycle is great, but what should you do about it?)

To make this a bit easier, I made up a little webpage that lets  you enter in a list of chains and indicate commutative links.  This page very effectively illustrates 3 things:

  • My ineptness at Javascript
  • My lack of graph theory knowledge
  • That there are some neat Javascript libraries out there (hello Dagre!)

Try it out here: http://rjpower.org/transaction-chain/

Creating fancy images with Matplotlib

I have to give a short presentation at SOSP next week, and for it, I needed to have some nice pictures representing a distributed array. After trying out several tools for trying to create these, I began to lament and cry over the state of Linux drawing software. But that’s a different story. I ended up writing a simple matplotlib script to generate the pictures I needed, and since it worked out pretty well, I thought I’d share it here.

Here’s the kind of picture I’m referring to:

array

It turns out this is pretty straightforward using matplotlib. Here’s the basic function:

def draw_array(a, target=None):
    fig = pylab.gcf()
    fig.frameon = False

    ax = fig.gca()
    #ax.set_axis_off()

    ax.patch.set_facecolor('white')
    ax.set_aspect('equal', 'box')
    ax.xaxis.set_major_locator(plt.NullLocator())
    ax.yaxis.set_major_locator(plt.NullLocator())

    size = 1.0
    z_scale = 1.4
    i = 0
    for z in reversed(range(a.shape[2])):
        for (x,y),v in np.ndenumerate(a[:, :, z]):
            i += 2
            alpha = a['transparency'][x,y,z]
            color = tuple(a['color'][x,y,z])
            off_x = 0.01 + x + size + z / z_scale
            off_y = y + size + z / z_scale

            rect = pylab.Rectangle([off_x, off_y], size, size,
                                   facecolor=color, edgecolor=(0,0,0),
                                   zorder = i, alpha = alpha)
            ax.add_patch(rect)

            cx = off_x + size/2
            cy = off_y + size/2

            # sigh
            label = str(a['name'][x,y,z])
            w, h = pylab.matplotlib.text.TextPath((0,0), label).get_extents().size / 30

            #print w, h

            text = pylab.Text(cx - w / 2, cy - h / 2, label, zorder=i+1)
            ax.add_artist(text)

    ax.autoscale_view()
    if target is not None:
        pylab.savefig(target)
    return ax

The first part of this just turns off the various lines for the axes. We then iterate through the elements of the array and create a Rectangle() for each one; each “layer” (z-axis) is shifted off to the right a little bit from the previous, to give our illusion of depth. (We don’t want a normal perspective projection, as it would hide too much of the deeper layers).

The “sigh” comment is where I’m using a hack to determine the size of the text we’re going to put in so I can center it in the array cell. I couldn’t find an easier way to do this, and no, I don’t know why I have to divide the result by 30.

The input array has 3 fields which specify how to render each rectangle:

dtype=([('color', 'f,f,f'), ('name', 'i'), ('transparency', 'f')]))

Now we can construct an arbitrary array and feed it into our function:

shape = (3,3,5)
a = np.ndarray(shape, dtype=([('color', 'f,f,f'), ('name', 'i'), ('transparency', 'f')]))
a['name'] = np.arange(np.prod(shape)).reshape(shape)
a['transparency'] = 1.0
a['color'] = (1,1,1)
return a

draw_array(a, target='array.pdf')

Once we have the basics out of the way, we can do some fancy rendering really easily. First, let’s make a little helper class to draw slices:

class draw_slice(object):
    def __init__(self, a, target=None):
        self.a = a
        self.target = target

    def __getitem__(self, slc):
        slice_z = np.copy(self.a)
        slice_z['color'][slc] = (0.9, 0.5, 0.3)
        slice_z['transparency'] = 0.9
        draw_array(slice_z, self.target)

We can wrap an array in draw_slice() to make it easy to construct pictures of slices:

draw_slice(a)[:,:,1]

slice-z

We can be fancier if we like too, drawing the results of a filter operation:,

draw_slice(a)[a['name'] <= 1]

filter

If you are interested, the full code for creating these figures is here: https://gist.github.com/rjpower/7249729. All you need is matplotlib and numpy.

statically linking shared libraries with libtool

I run a lot of experiments on our local cluster.  Unfortunately, over time, the library versions on the cluster tend to diverge from those on my local machine. As a result, I’ve gotten used to seeing this:

/usr/bin/python: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.17' not found 
(required by /home/power/w/spartan/build/.libs/libspartan.so.0)

If I was using Go this wouldn’t be a problem, as they statically link everything.  Despite the crowd of people who think shared libraries are the bees knees, I agree with this approach — it’s just far simpler than trying to deal with errors like the above. You can solve this most of the time by simply statically linking everything (–enable-static).  Sadly, if you’re trying to build a shared library (in my case, an extension module for Python), you can’t really go this route. (Statically linking Python API calls into a library which is then hoisted into Python is going to end very, very badly). What am I supposed to do with this error? If you trawl around the web, you find that depending on the exact error, you should either:

  • find an old version of GLIBC and link against that
  • insert assembly directives to indicate the old symbol

If you happen to have a single symbol that’s pulling the newer version, the latter is an easy fix (though it’s a bit annoying to ensure you’ve always got the directive declared before you use a function). If you’ve somehow acquired a dependency on the whole library things become more annoying. In my case, this dependency seemed to result from the chain:

_spartan_wrap.so -> libspartan.so -> libstdc++.so -> libc.so

Oddly enough, depending directly on libstdc++ isn’t the problem. If I remove the dependency on libspartan (just linking directly against all of the objects), we’re fine:

ldd -v .libs/_spartan_wrap.so
...

        Version information:
        .libs/_spartan_wrap.so:
                librt.so.1 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/librt.so.1
                libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
                libc.so.6 (GLIBC_2.15) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libpthread.so.0 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libpthread.so.0
                libpthread.so.0 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libpthread.so.0
                libstdc++.so.6 (GLIBCXX_3.4.14) => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.15) => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.10) => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
                libstdc++.so.6 (CXXABI_1.3) => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4) => /usr/lib/x86_64-linux-gnu/libstdc++.so.6

Notice, no reference to GLIBC_2.17. Luckily in this case there’s a simple solution: make all of the helper libraries into convenience libraries. This causes those libraries to be statically linked and avoids pulling in the extra dependencies:

# old
# lib_LTLIBRARIES = _wrap.la liba.la libb.la

# new 
noinst_LTLIBRARIES = liba.la libb.la
lib_LTLIBRARIES = _wrap.la

If we weren’t lucky — we’re actually using the symbol that’s from a later version — than we can force static linking of your dependent library; you do this by listing it explicitly by name in your LIBADD variable:

_spartan_wrap_la_LIBADD = -module -lrt /usr/lib/gcc/x86_64-linux-gnu/4.8/libstdc++.a

libtool will complain that linking against a static library isn’t portable (which is true), but it should work correctly as long as the static library was built with -fPIC.

the Onion

We used to have the Onion (America’s Finest News Source) available for free here in New York, in those little weekly newspaper boxes you see lying around. After the hurricane last year, the supply seems to have dried up. This was a sad event. Not only did I get high quality journalism from the paper, they also had a very nice crossword puzzle for lazy Saturday mornings.

Enterprising individuals have begun reusing the boxes for temporary clothing storage, and some of them have been commandeered for other, not as interesting papers. (New York real estate is valuable, after all).

I don’t often check their website, as it adds to the already crippling amount of distraction that I experience in a day, but this article caught my eye. As a person who finds comfort in the knowledge that the universe will eventually empty out into a cold cinder, I can’t help but approve when others notice the same.

a blast from the past

While reading some EE articles, I happened across a link to a Webring site. Remember Webrings? Probably not. They were the rage back in the 90’s. Not so popular anymore, as all content gets swallowed up into the Facebook/Tumblr/Google ecosystems.

The nostalgia hit of clicking through the various pages in the ring was quite impressive. All of those pages “made with notepad”. The blink tags, marquees and counters, all used without irony. Ahh, those were the days.

For the curious, start here.

Citibike: Bike Flow

I’m a big fan of the Citibike bike share program that started here recently.  One common issue I and others seem to suffer from is the lack of bikes (when starting a trip) or docks (when ending a trip).  Our neighborhood tends to be a very popular destination in the evening, so when I try to ride a bike in, I often end up a few blocks away from my desired station.  Similarly, if I get a late start in the morning I often find there are no bikes left to ride.

I was curious about how the flow of bikes works around the city — where do the bikes go to when they leave the East Village?  I crawled the Citibike web site and created a simple website to visualize the flow of bikes around the city; the results are pretty interesting:

citibike

With some more work on the data, it might be possible to use it for predictions (“will I be able to return this bike to my station”) and to aid in balancing (choosing which stations to move bikes between, and at what time).

The source code for the application is available on Github.

Making a JIT interpreter with LuaJIT

(N.B. The code for all of these experiments is on Github).

I recently read this post by François Perrad regarding interpreters, where he compared interpreter loops written in Lua, LuaJit and Pypy.  (I think the original toy interpreter example comes from PyPy).   After some suggestions, he ended up with a new PyPy version which performed very well — close to what you’d see from a static compiler.

The bytecode ‘program’ being used for all of these examples is simply calculating, in a round-about fashion, the square of an input number:

MOV_A_R,    0,
MOV_A_R,    1,
MOV_R_A,    0, 
DECR_A,
MOV_A_R,    0,
MOV_R_A,    2, 
ADD_R_TO_A, 1,
MOV_A_R,    2,
MOV_R_A,    0, 
JUMP_IF_A,  4,
MOV_R_A,    2,
RETURN_A

I made a slight modification to this interpreter to force PyPy to load the bytecode at runtime (to ensure it doesn’t “cheat” during translation by just statically optimizing for this particular program).    This version runs quickly, but not as fast as the version that has the bytecode baked in.  It evaluates 100M iterations of the bytecode loop in 1.6seconds; still this is roughly a hundred times faster then the CPython equivalent.  This is what you’d expect, after all — it’s what PyPy is designed for.

The lua based interpreter, when run with Luajit, takes 5.5 seconds; not bad, but it’s 4 times slower then PyPy. Can we do better?  What’s causing Luajit to run slowly?  If we turn on jit debugging for luajit, we see the problem immediately:

luajit -jdump toy-jit.lua 100000000

There’s no output!  The JIT compiler never activated.  What’s going on?

It turns out that our interpreter loop (like all interpreter loops), is unpredictable, as the path of the execution is very data dependent (‘data’ here meaning the bytecode we’re interpreting):

while true do
  local opcode = bytecode[pc]
  pc = pc + 1 
  if opcode == JUMP_IF_A then
    local target = bytecode[pc]
    pc = pc + 1 
    if a ~= 0 then
      pc = target
    end
  elseif opcode == MOV_A_R then
    ...
  elseif opcode == MOV_R_A then
    ...
  elseif opcode == ADD_R_TO_A then
    ...
  elseif opcode == DECR_A then
    ...
  elseif opcode == RETURN_A then 

After executing a bytecode, the interpreter goes back up to the top of the while, and jumps to a different place. A tracing JIT never gets a chance to see the pattern, and so you end up running in the interpreter the whole time.  PyPy solves this problem by using magic meta-tracing.

It turns out we can get a similar effect in Luajit, without too much effort, using partial evaluation.  That is, given a chunk of bytecode, we’ll generate a specialized version of our interpreter for that bytecode.  We do this, in time-honored fashion, by copy-pasting. We step through each opcode, and instead of evaluating it, we build up a Lua string to evaluate it (A much cleaner approach would be to write our interpreter in some structured fashion, and generate the JIT interpreter from that):


if opcode == JUMP_IF_A then
      local target = bytecode[pc]
      pc = pc + 1
      f_str = f_str .. string.format([[
if a == 0 then
  goto op_%d
end
goto op_%d
]], pc, target)
    elseif opcode == MOV_R_A then
      local n = bytecode[pc]
      pc = pc + 1
    f_str = f_str .. string.format([[
a = reg_%d
]], n)

For our test program, this creates a Lua string like this:

function _jit(a)
  local reg = {0, 0, 0, 0, 0, 0, 0, 0}
  ::op_1::
reg[1] = a
::op_3::
reg[2] = a
::op_5::
a = reg[1]
::op_7::
a = a - 1
::op_8::
reg[1] = a
::op_10::
a = reg[3]
::op_12::
a = a + reg[2]
::op_14::
reg[3] = a
::op_16::
a = reg[1]
::op_18::
if a == 0 then
  goto op_20
end
goto op_5
::op_20::
a = reg[3]
::op_22::
return a
end

If we eval() this string, we get back an interpreter that’s been specialized for just this bytecode. What does our performance look like now?

time pypy-jit-c /home/power/tmp/bytecode.str 100000000
1.63s user 0.01s system 99% cpu 1.649 total

time luajit toy.lua 100000000       
5.51s user 0.01s system 99% cpu 5.549 total

time luajit toy-jit.lua 100000000
0.12s user 0.00s system 97% cpu 0.128 total

We’re now much faster then PyPy! Obviously this trick is easier to play with such a simple interpreter (we’re also using the native numeric type of our JIT, which isn’t always correct). Amore complex, dynamically typed systems might prove to be more difficult to do partial evaluation on. There also could be extra hints I could give to PyPy to make it work better (if you have any ideas, please tell me!).

Still, it’s somewhat surprising how easy it was to generate our ‘JIT’ interpreter — the code isn’t much bigger then the original version. Perhaps with some more scaffolding/helper libraries, this could be a viable way to create fast interpreters for new languages?

scidb performance

We searched around trying to find any reasonable comparison of Scidb performance to existing systems (specifically, we’re looking at doing straightforward bulk-parallel operations like logistic regression/k-means).  So far, we’ve found the performance is very poor — an order of magnitude or 2 worse then single machine runs or systems like Spark.

Any ideas what we’re doing wrong? The code is below. We’re running this across 4 machines with 32GB of memory each. The example code is just trying to do the regression on 100000 samples with 10 dimensions.

The insert(multiply(X, W)) query itself seems to take several seconds. What’s going on here? On a single machine, this operation is less then a millisecond; even accounting for disk reads/network issues I’d expect this to be a hundred times faster then it is.

Trying to run with more data causes the system to run out of memory.


#!/bin/bash

function create_randoms(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating random array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,
y=0:$y,$CHUNK_SIZE,0], double(1)/(random() % 10 + 1)), $3)" >/dev/null
}

function create_zeros(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating zero array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,y=0:$y,$CHUNK_SIZE,0],0),$3)" >/dev/null
}

function create_template(){
  x=$(($1 - 1))
  iquery -nq "create array Template [x=0:$x,$CHUNK_SIZE,0]" >/dev/null
}

function exe(){
  iquery -o csv+ -q  "$1"
}

function exe_silent(){
  iquery -nq "$1" >/dev/null
}

function clean(){
  exe_silent "drop array W"
  exe_silent "drop array X"
  exe_silent "drop array Y"
  exe_silent "drop array Pred"
  exe_silent "drop array Diff"
  exe_silent "drop array Grad"
  exe_silent "drop array Temp"
  exe_silent "drop array Template"
}

function init(){
  create_randoms $N $D X
  create_randoms $N 1 Y
  create_randoms $D 1 W
  create_zeros $N 1 Pred
  create_zeros $D 1 Grad
  create_zeros $N 1 Diff
  create_template $N
}

D=10 #Dimensions.
N=100000 #Points.
CHUNK_SIZE=10000

clean
init #Create arrays.

for i in {1..5}
do
iquery <<HERE
set no fetch;

set lang afl;
insert(multiply(X, W), Pred);

set lang aql;
update Pred set val= double(1)/(double(1) + exp(-val));
select Pred.val - Y.val into Diff from Pred, Y;
select sum(new_val) as val into Temp from (select X.val * R.val as
new_val from X join reshape(Diff, Template) as R on X.x = R.x) group
by y;
select val into Grad from reshape(substitute(Temp, build(Temp, 0)), Grad);

set fetch;
select W.val + 0.001 * Grad.val into W from W, Grad;
HERE
done

clean


a simple watchdog for subprocesses

A problem I often encounter when writing code that deals with subprocesses is the issue of orphans. These are perfectly useful in many situations, such as keeping program running via `disown`. They’re also real annoyances when you’re trying to fork worker processes; every time your parent process has a bug, you end up with orphaned workers that you have to manually terminate.

For instance, I might want to launch a bunch of worker processes, each listening on a separate port:


for i in range(num_procs):
  subprocess.Popen('python worker.py %d' % 9999 + i)

The orphan problem crops up if the parent dies without terminating the worker process (gracefully or otherwise). After the parent process is killed, the workers are left running in the background; holding onto both resources and their port. We now have to manually terminate them in order to try running again:


pkill -9 -f worker.py

You can use the `atexit` functionality from most languages to try to cleanup, but this doesn’t help if your process segfaults, or has some other ignominious end.

In a “real” environment, this is typically solved by having a cluster manager which schedules jobs and can start and kill processes as needed (for example SLURM or Torque. But often we’re running in an environment where this is unavailable (either our local machine, or we just haven’t bothered to set something up). What do we do?

The solution here is watchdog timer which terminates the workers if the master process dies. This can be accomplished in a number of ways, including via the RPC system, but I’ve found a simple mechanism that works well for me: polling the stdin channel.


class FileWatchdog(threading.Thread):
  """Watchdog for a file (typically `sys.stdin`).
 
  When the file closes, terminate the process.
  (This typically occurs when the parent process is lost.)
  """
  def __init__(self, file_handle):
    threading.Thread.__init__(self, name='WatchdogThread')
    self.setDaemon(True)
    self.file_handle = file_handle
    # self.log = open('/tmp/watchdog.%d' % os.getpid(), 'w')
 
  def run(self):
    f = [self.file_handle]
    while 1:
      r, w, x = select.select(f, f, f, 1.0)
      # print >>self.log, 'Watchdog running: %s %s %s' % (r,w,x)
      # self.log.flush()
      if w:
        # print >>self.log, 'Watchdog: file closed.  Shutting down.'
        # self.log.flush()
        os._exit(1)
      time.sleep(1)

I just have my worker processes spawn a watchdog thread at startup time, and they will terminate themselves as soon as they lose their connection to the master process. This simple mechanism falls apart if you’re using stdin to communicate to your child processes, but it works well for most other simple needs.