s="abc" s.slice!(0,2) # => "ab" s # => "c"
String#slice! is often used when parsing protocols: 1) read some data into a buffer string, 2) read bytes indicating how much data is in the current message, 3) slice! off that amount of data from the buffer, 4) do something with it, loop back to step 2 until buffer has no more full messages. In most situations the buffer always stays small because you parse every message out of the buffer each time you read data from the stream and usually you aren’t reading that many bytes from the stream at a time. But in some rare cases you may have to deal with larger buffers.
String#slice!’s runtime is approximately proportional to the length of the string being slice!d (the length of the resulting string is pretty much irrelevant). This can be a problem when you’re parsing many small messages out of a large string. For example, calling #slice! on a 1 meg string is about 10x slower than calling it on a 10k string. This means your parsing is going to be 10x slower if you’re parsing a single 1 meg buffer rather than a bunch of smaller 10k buffers. So how do we fix this?
An “overflow buffer”. Keep most of the data in a large overflow buffer which you transfer to a smaller active buffer as needed:
data = read_from_stream() overflow_buffer << data loop do buffer << overflow_buffer.slice!(0,BUFFER_SIZE) if buffer.size < BUFFER_MIN_SIZE # parse buffer with buffer.slice! end
This way you only call slice! on a large string a few times and most of the calls are on a relatively small string. The question now is what values should you chose for BUFFER_SIZE and BUFFER_MIN_SIZE. BUFFER_MIN_SIZE is easy, it should be the size of your largest possible message. Alternately you can use something smarter which copies data from the overflow buffer as needed based on your parsing logic.
Choosing BUFFER_SIZE is a little more difficult. It depends on the average amount you’re slice!ing off the buffer each time (SLICE_SIZE) which determines how many times you slice! it. It also depends on the average size of the overflow buffer (OVERFLOW_SIZE).
Let’s simply call these variables b (BUFFER_SIZE), s (SLICE_SIZE), and o (OVERFLOW_SIZE). Given s and o we want to find the optimal value of b. So we need to model the runtime of parsing the entire buffer in terms of those 3 variables. Since the runtime of slice! is approximately proportional to string length we can simply use string length as a replacement for runtime.
The runtime of parsing the entire buffer can be represented by two summations. First is the sum of the times you slice the overflow buffer:

That is, we will slice the overflow buffer o/b times and each time the size of the overflow buffer will be b less than before.
The second is the sum of the times you slice the small buffer multiplied by the number of times you copy from the overflow buffer into the small buffer:

Now we just take the derivative of these summations with respect to b, plug in whatever your values are for s and o and solve for 0.


So if your average maximum buffer size is 1 meg and your average slice size is 1k your optimal BUFFER_SIZE is about 10000*sqrt(10) or 31600. Doing some actual benchmarks I found the optimal to be around 40000 so the math comes pretty close to practice.
WARNING: Don’t actually try to optimize your string slicing unless you know you actually have a bottleneck on String#slice! and that the bottleneck is due to the strings you’re slicing being large. Otherwise people will just laugh at you and call you a noob.
nice post :-)
Comment by raggi — January 10, 2011 @ 6:06 pm
Well-placed warning. Premature optimisation is a killer.
Comment by Arlen — January 11, 2011 @ 1:39 am
Hey Steve,
Awesome content – I guess I can pin the use case to a particular repo :-)
Mad math representation there as well.
- Lourens
Comment by Lourens Naude — January 11, 2011 @ 12:13 pm
Thanks, interesting :)
Comment by rbjl — January 11, 2011 @ 5:51 pm
This rather good thought is necessary just by the way
Comment by Ералы — March 9, 2011 @ 2:57 am