Update: This issue has been fixed in revision 21165 of 1.8
After a long debugging session from hell I finally determined what was completely deadlocking my app. And by deadlocking I mean EVERY thread stops running. Even simple threads that only loop and print something.
The problem occurs when you try to read from the same socket (and probably happens with some other IO subclasses as well) with two different threads (at the same time). It comes down to what seems to be a race condition in an internal C function (io_getpartial) used in IO’s read methods.
Obviously reading from the same socket with two different threads is probably not something you should be doing. But at the same time the interpreter probably shouldn’t ever be deadlocking all threads like that.
So here’s some code to reproduce the deadlock:
require 'socket'
# if this ever stops printing we're screwed
Thread.new{loop{p 1;sleep 0.5}}
s = Thread.new do
cli = TCPServer.new(3020).accept
loop { cli.write 'hi' }
end
sock = TCPSocket.new('localhost', 3020)
2.times do
Thread.new do
loop { sock.readpartial 1024 }
end
end
s.join
The deadlock occurs in the latest 1.8.6 and 1.8.7 MRIs. It doesn’t occur in 1.9 or jruby. The simplest solution to protect you from this is to just synchronize whatever read methods you’re using:
class IO
alias_method :locking_readpartial, :readpartial
@@mutex_guard = Mutex.new
def readpartial(size)
@@mutex_guard.synchronize { @readpartial_guard ||= Mutex.new }.synchronize do
locking_readpartial(size)
end
end
end
You can/need to do the same for IO#read if that’s what you’re using.
If you care about the actual specifics, here’s the function (io.c:5946) where the interpreter hangs, line 49:
static VALUE
io_getpartial(int argc, VALUE *argv, VALUE io, int nonblock)
{
OpenFile *fptr;
VALUE length, str;
long n, len;
rb_scan_args(argc, argv, "11", &length, &str);
if ((len = NUM2LONG(length)) < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}
if (NIL_P(str)) {
str = rb_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str, len);
}
OBJ_TAINT(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
if (len == 0)
return str;
if (!nonblock) {
READ_CHECK(fptr->f);
}
if (RSTRING(str)->len != len) {
modified:
rb_raise(rb_eRuntimeError, "buffer string modified");
}
n = read_buffered_data(RSTRING(str)->ptr, len, fptr->f);
if (n <= 0) {
again:
if (RSTRING(str)->len != len) goto modified;
if (nonblock) {
rb_io_set_nonblock(fptr);
n = read(fileno(fptr->f), RSTRING(str)->ptr, len);
}
else {
TRAP_BEG;
// First thread completes call w/o blocking, and since it has now read
// the available data the second thread blocks on this call.
n = read(fileno(fptr->f), RSTRING(str)->ptr, len);
TRAP_END;
}
if (n < 0) {
if (!nonblock && rb_io_wait_readable(fileno(fptr->f)))
goto again;
rb_sys_fail(fptr->path);
}
}
rb_str_resize(str, n);
if (n == 0)
return Qnil;
else
return str;
}
The READ_CHECK() macro will actually switch between threads. So it gets into some situation where it thinks there is data to read, but then schedules another thread to run which actually reads the data so there is no longer data to read, so when it gets to the actual reading of the data it blocks forever. At least that’s my minimal understanding of the issue.
Update: bug filed

It would be nice if you file an issue on http://redmine.ruby-lang.org/.
Comment by Radarek — December 10, 2008 @ 9:48 am
Will do, I’ve been trying to get subscribed to ruby-core so I can send it there but the ML still hasn’t registered me after a few hrs…
Comment by coderrr — December 10, 2008 @ 9:50 am
Of course, there’s also a race in
(@readpartial_guard ||= Mutex.new)
:-)
Comment by Anon — December 10, 2008 @ 11:51 am
derrr, of course, lemme fix
thx!
Comment by coderrr — December 10, 2008 @ 11:55 am
[...] of people making these mistakes recently, even in code which touts being thread safe (even in my own [...]
Pingback by ||= is NOT thread safe, neither is Hash.new {|h,k| … } « coderrr — April 29, 2009 @ 12:32 pm