coderrr

December 10, 2008

MRI IO deadlock

Filed under: bug, network, ruby — Tags: , , — coderrr @ 8:30 am

Update: This issue has been fixed in revision 21165 of 1.8

After a long debugging session from hell I finally determined what was completely deadlocking my app. And by deadlocking I mean EVERY thread stops running. Even simple threads that only loop and print something.

The problem occurs when you try to read from the same socket (and probably happens with some other IO subclasses as well) with two different threads (at the same time). It comes down to what seems to be a race condition in an internal C function (io_getpartial) used in IO’s read methods.

Obviously reading from the same socket with two different threads is probably not something you should be doing. But at the same time the interpreter probably shouldn’t ever be deadlocking all threads like that.

So here’s some code to reproduce the deadlock:

require 'socket'
# if this ever stops printing we're screwed
Thread.new{loop{p 1;sleep 0.5}}

s = Thread.new do
  cli = TCPServer.new(3020).accept
  loop { cli.write 'hi' }
end

sock = TCPSocket.new('localhost', 3020)
2.times do
  Thread.new do
    loop { sock.readpartial 1024 }
  end
end

s.join

The deadlock occurs in the latest 1.8.6 and 1.8.7 MRIs. It doesn’t occur in 1.9 or jruby. The simplest solution to protect you from this is to just synchronize whatever read methods you’re using:

class IO
  alias_method :locking_readpartial, :readpartial
  @@mutex_guard = Mutex.new
  def readpartial(size)
    @@mutex_guard.synchronize { @readpartial_guard ||= Mutex.new }.synchronize do
      locking_readpartial(size)
    end
  end
end

You can/need to do the same for IO#read if that’s what you’re using.

If you care about the actual specifics, here’s the function (io.c:5946) where the interpreter hangs, line 49:

static VALUE
io_getpartial(int argc, VALUE *argv, VALUE io, int nonblock)
{
    OpenFile *fptr;
    VALUE length, str;
    long n, len;

    rb_scan_args(argc, argv, "11", &length, &str);

    if ((len = NUM2LONG(length)) < 0) {
        rb_raise(rb_eArgError, "negative length %ld given", len);
    }

    if (NIL_P(str)) {
        str = rb_str_new(0, len);
    }
    else {
        StringValue(str);
        rb_str_modify(str);
        rb_str_resize(str, len);
    }
    OBJ_TAINT(str);

    GetOpenFile(io, fptr);
    rb_io_check_readable(fptr);

    if (len == 0)
        return str;

    if (!nonblock) {
        READ_CHECK(fptr->f);
    }
    if (RSTRING(str)->len != len) {
      modified:
        rb_raise(rb_eRuntimeError, "buffer string modified");
    }
    n = read_buffered_data(RSTRING(str)->ptr, len, fptr->f);
    if (n <= 0) {
      again:
        if (RSTRING(str)->len != len) goto modified;
        if (nonblock) {
            rb_io_set_nonblock(fptr);
            n = read(fileno(fptr->f), RSTRING(str)->ptr, len);
        }
        else {
            TRAP_BEG;
            // First thread completes call w/o blocking, and since it has now read
            // the available data the second thread blocks on this call.
            n = read(fileno(fptr->f), RSTRING(str)->ptr, len);
            TRAP_END;
        }
        if (n < 0) {
            if (!nonblock && rb_io_wait_readable(fileno(fptr->f)))
                goto again;
            rb_sys_fail(fptr->path);
        }
    }
    rb_str_resize(str, n);

    if (n == 0)
        return Qnil;
    else
        return str;
}

The READ_CHECK() macro will actually switch between threads. So it gets into some situation where it thinks there is data to read, but then schedules another thread to run which actually reads the data so there is no longer data to read, so when it gets to the actual reading of the data it blocks forever. At least that’s my minimal understanding of the issue.

Update: bug filed

5 Comments »

  1. It would be nice if you file an issue on http://redmine.ruby-lang.org/.

    Comment by Radarek — December 10, 2008 @ 9:48 am

  2. Will do, I’ve been trying to get subscribed to ruby-core so I can send it there but the ML still hasn’t registered me after a few hrs…

    Comment by coderrr — December 10, 2008 @ 9:50 am

  3. Of course, there’s also a race in

    (@readpartial_guard ||= Mutex.new)

    :-)

    Comment by Anon — December 10, 2008 @ 11:51 am

  4. derrr, of course, lemme fix

    thx!

    Comment by coderrr — December 10, 2008 @ 11:55 am

  5. [...] of people making these mistakes recently, even in code which touts being thread safe (even in my own [...]

    Pingback by ||= is NOT thread safe, neither is Hash.new {|h,k| … } « coderrr — April 29, 2009 @ 12:32 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: