coderrr

January 11, 2009

Ruby and mysqlplus select() deadlock

Filed under: bug, c, concurrency, patch, ruby — Tags: , , , , — coderrr @ 7:07 pm

After setting up mysqlplus in my Rails project I ran into an interpreter wide deadlock in certain situations. I isolated it and tracked it down to two things…

… Before I continue the simple solution to this is to use the C implementation of async_query instead of the Ruby version.

class Mysql
  alias_method :query, :c_async_query
# instead of
  alias_method :query, :async_query
end

If you want more details read on…

1) Mysqlplus’ default async_query which is implemented in ruby:

  def async_query(sql, timeout = nil)
    send_query(sql)
    select [ (@sockets ||= {})[socket] ||= IO.new(socket) ], nil, nil, nil
    get_result
  end

The send_query, get_result, and socket methods are C functions.
and
2) ActiveRecord::Base.clear_reloadable_connections! which is called after every Rails request in development mode:

# actually implemented in connection_pool.rb
      def clear_reloadable_connections!
        @reserved_connections.each do |name, conn|
          checkin conn
        end
        @reserved_connections = {}
        @connections.each do |conn|
          conn.disconnect! if conn.requires_reloading?
        end
        @connections = []
      end

The actual code we care about here is conn.disconnect! which will call mysqlplus’ disconnect method which is implemented as a C function. If we have Rails skip the call to disconnect!, no deadlock. Or if we use the C implementation of async_query, which is named c_async_query, no deadlock.

The issue has something to do with calling Ruby’s IO#select on a file descriptor which you are manipulating with native functions. While looking into it I found a separate but related issue. The bug and fix I show below does not actually resolve the deadlock I was running into, but solves a similar one. The mysqlplus deadlock actually does not occur during the async_query’s IO#select but at some later point which I couldn’t exactly determine.

The condition can be reproduced by calling the native C function close() on a file descriptor which Ruby is currently IO#selecting.

require 'socket'
require 'rubygems'
require 'inline'

module C
  class << self
    inline do |builder|
      builder.c %q{
        static VALUE native_close(int s) {
          close(s);
          return Qnil;
        }
      }
    end
  end
end

Thread.new { loop { sleep 1; p 1 } }

Thread.new do
  loop do
    io = TCPSocket.new('google.com', 80)
    fd = io.to_i
    Thread.new { sleep 0.5; C.native_close(fd) }
#   Thread.new { sleep 0.5; io.close }
    p :selecting!
    rdy = select [io]
    p :selected!
  end
end

sleep 99999

This example will produce a deadlock on select after the second thread calls native_close. If you swap the close lines so that you are closing the socket with Ruby’s close method instead of the native one you won’t get a deadlock. After lots of debugging I narrowed it to down to what seems to be a bug in Ruby’s rb_thread_schedule function:

        n = select(max+1, &readfds, &writefds, &exceptfds, delay_ptr);
        if (n < 0) {
            // select is returning -1 indicating an error
            int e = errno;

The deadlock is actually Ruby calling rb_thread_schedule over and over for the thread which is selecting instead of deferring to let other threads run. I stuffed in a call to perror() and saw that the error is caused by a bad file descriptor. But for some reason Ruby doesn’t handle that error correctly by removing that fd from the fd_set. So I fixed it by going through to determine if there are any bad file descriptors in the set and if so remove them:

Update: Simplified remove_bad_fds function thanks to costan recommending fcntl() over select()

        n = select(max+1, &readfds, &writefds, &exceptfds, delay_ptr);
        if (n < 0) {
            int e = errno;
// ...
            if (e == EBADF)
              remove_bad_fds(&th->readfds, &th->writefds, &th->exceptfds, max);

// ...

#include <fcntl.h>
static void
remove_bad_fds(fd_set *r, fd_set *w, fd_set *e, int max) {
  int fd;

  for (fd = 0; fd <= max; fd++)
    if (FD_ISSET(fd, r) || FD_ISSET(fd, w) || FD_ISSET(fd, e))
      if (fcntl(fd, F_GETFD) < 0 && errno == EBADF) {
        FD_CLR(fd, r);
        FD_CLR(fd, w);
        FD_CLR(fd, e);
      }
}

The remove_bad_fds calls fcntl for each fd from the sets to determine if it is bad. If so, it is removed.

bug filed

8 Comments »

  1. [...] I ran into a deadlock with the Ruby implementation of async_query, so use the C one instead. Then add this somewhere so [...]

    Pingback by ActiveRecord threading issues and resolutions « coderrr — January 11, 2009 @ 7:08 pm

  2. k I updated it to default to the c_async_query as you described. Let me know if it doesn’t work.

    I’d recommend the bug listed be submitted to ruby-core, too :)

    -=r

    Comment by roger — January 12, 2009 @ 6:46 pm

  3. does 1.9 have this bug? Does the patch work on windoze :) ?
    -=r

    Comment by roger — January 12, 2009 @ 7:58 pm

  4. Haven’t tried on 1.9, is mysqlplus 1.9 compat?

    Windows? lol

    Comment by coderrr — January 12, 2009 @ 8:01 pm

  5. yeah mysqlplus will compile on 1.9.
    I was wondering more of the
    remove_bad_fds
    stuff is necessary for 1.9

    I almost wonder if, like Python, the select function should raise on the thread that passed it a bad descriptor.

    Yeah windows is definitely the step child of Ruby land, though I hear jruby actually runs on it with reasonable speed :)
    -=r

    Comment by roger — January 12, 2009 @ 8:09 pm

  6. [...] о такой поделке, как драйвер MySqlPlus. Но лично мне, как-то ссыкотно его запускать на [...]

    Pingback by MySQL vs PostgreSQL в ActiveRecord | Uniвсячина — February 24, 2009 @ 11:08 pm

  7. I am wondering if the problem was that there was no concurrency lock around the assignment to @sockets ?

    Comment by roger — April 18, 2009 @ 11:09 pm

  8. Yea both those unsynchronized ||=’s could be it

    Comment by coderrr — April 18, 2009 @ 11:13 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.