coderrr

January 10, 2011

String#slice!, Overflow Buffers, and Optimization Math

Filed under: ruby19 — Tags: — coderrr @ 5:57 pm
s="abc"
s.slice!(0,2) # => "ab"
s # => "c"

String#slice! is often used when parsing protocols: 1) read some data into a buffer string, 2) read bytes indicating how much data is in the current message, 3) slice! off that amount of data from the buffer, 4) do something with it, loop back to step 2 until buffer has no more full messages. In most situations the buffer always stays small because you parse every message out of the buffer each time you read data from the stream and usually you aren’t reading that many bytes from the stream at a time. But in some rare cases you may have to deal with larger buffers.

String#slice!’s runtime is approximately proportional to the length of the string being slice!d (the length of the resulting string is pretty much irrelevant). This can be a problem when you’re parsing many small messages out of a large string. For example, calling #slice! on a 1 meg string is about 10x slower than calling it on a 10k string. This means your parsing is going to be 10x slower if you’re parsing a single 1 meg buffer rather than a bunch of smaller 10k buffers. So how do we fix this?

An “overflow buffer”. Keep most of the data in a large overflow buffer which you transfer to a smaller active buffer as needed:

data = read_from_stream()
overflow_buffer << data
loop do
  buffer << overflow_buffer.slice!(0,BUFFER_SIZE)  if buffer.size < BUFFER_MIN_SIZE
  # parse buffer with buffer.slice!
end

This way you only call slice! on a large string a few times and most of the calls are on a relatively small string. The question now is what values should you chose for BUFFER_SIZE and BUFFER_MIN_SIZE. BUFFER_MIN_SIZE is easy, it should be the size of your largest possible message. Alternately you can use something smarter which copies data from the overflow buffer as needed based on your parsing logic.

Choosing BUFFER_SIZE is a little more difficult. It depends on the average amount you’re slice!ing off the buffer each time (SLICE_SIZE) which determines how many times you slice! it. It also depends on the average size of the overflow buffer (OVERFLOW_SIZE).

Let’s simply call these variables b (BUFFER_SIZE), s (SLICE_SIZE), and o (OVERFLOW_SIZE). Given s and o we want to find the optimal value of b. So we need to model the runtime of parsing the entire buffer in terms of those 3 variables. Since the runtime of slice! is approximately proportional to string length we can simply use string length as a replacement for runtime.

The runtime of parsing the entire buffer can be represented by two summations. First is the sum of the times you slice the overflow buffer:

sum i=0 to o/b, (o - b*i)

That is, we will slice the overflow buffer o/b times and each time the size of the overflow buffer will be b less than before.

The second is the sum of the times you slice the small buffer multiplied by the number of times you copy from the overflow buffer into the small buffer:

(o/b) * sum i=0 to b/s, (b - s*i)

Now we just take the derivative of these summations with respect to b, plug in whatever your values are for s and o and solve for 0.

d/db (sum (o-b*i), i=0 to o/b) + d/db (sum o/b*(b-s*i), i=0 to b/s) = 0, o = 1000000, s = 1000

So if your average maximum buffer size is 1 meg and your average slice size is 1k your optimal BUFFER_SIZE is about 10000*sqrt(10) or 31600. Doing some actual benchmarks I found the optimal to be around 40000 so the math comes pretty close to practice.

WARNING: Don’t actually try to optimize your string slicing unless you know you actually have a bottleneck on String#slice! and that the bottleneck is due to the strings you’re slicing being large. Otherwise people will just laugh at you and call you a noob.

5 Comments »

  1. nice post :-)

    Comment by raggi — January 10, 2011 @ 6:06 pm

  2. Well-placed warning. Premature optimisation is a killer.

    Comment by Arlen — January 11, 2011 @ 1:39 am

  3. Hey Steve,

    Awesome content – I guess I can pin the use case to a particular repo :-)

    Mad math representation there as well.

    – Lourens

    Comment by Lourens Naude — January 11, 2011 @ 12:13 pm

  4. Thanks, interesting :)

    Comment by rbjl — January 11, 2011 @ 5:51 pm

  5. This rather good thought is necessary just by the way

    Comment by Ералы — March 9, 2011 @ 2:57 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: