coderrr

May 8, 2008

Substring queries with Solr (acts_as_solr)

Filed under: java, rails, solr — Tags: , , — coderrr @ 10:39 am

Shameless Plug: Protect your search queries from being linked to your identity with a VPN Service.

Synopsis: To be able to search by substring with acts_as_solr replace the text field type definition with this in your vendor/plugins/acts_as_solr/solr/solr/conf/schema.xml file

The acts_as_solr Rails plugin allows you to interface with the Solr search server. It makes indexing your models extremely simple, usually just a single line in your model file is needed.

Acts_as_solr didn’t seem to provide a simple way to have Solr index my fields so that I could search by substring. After a lot of pain I figured it out.

Solr has a few different types of tokenizers which it uses to split up data before it indexes it. If you look in Solr’s schema.xml file you can see the default tokenizer for a “text” field is solr.WhitespaceTokenizerFactory. This allows you to search by individual words (or anything delimited by whitespace) from your text, but not by a substring.

To be able to search by substring you must use the NGramTokenizerFactory. The NGramTokenizer indexes all possible substrings of your text (within the limits you set). So if the text is “sam” you’d end up with the following strings indexed: s, sa, sam, a, am, m.

It takes two parameters. Minimum and maximum gram size. These specify the shortest and longest substrings, respectively, that Solr will index.

Here’s an example of how to use the NGramTokenizerFactory in your schema.xml:

        <fieldType name="text" class="solr.TextField" >
            <analyzer type="index">
                <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>

Be sure to put this in the correct schema.xml. Acts_as_solr ships with two schema.xml files. The correct one is at this path: vendor/plugins/acts_as_solr/solr/solr/conf/schema.xml. After you add this to your schema.xml you will need to restart and reindex the solr server: rake solr:stop; rake solr:start; rake solr:reindex

To see how it actually works here an excerpt from Lucene’s NGramTokenizer class.

  public final Token next() throws IOException {
    if (!started) {
      started = true;
      gramSize = minGram;
      char[] chars = new char[1024];
      input.read(chars);
      inStr = new String(chars).trim();  // remove any trailing empty strings 
      inLen = inStr.length();
    }

    if (pos+gramSize > inLen) {            // if we hit the end of the string
      pos = 0;                           // reset to beginning of string
      gramSize++;                        // increase n-gram size
      if (gramSize > maxGram)            // we are done
        return null;
      if (pos+gramSize > inLen)
        return null;
    }
    String gram = inStr.substring(pos, pos+gramSize);
    int oldPos = pos;
    pos++;
    return new Token(gram, oldPos, oldPos+gramSize);
}

As you can see it first iterates over the string creating tokens (grams) of the minimum length. It then increments the token size and iterates again. So in the case of “sam”, it would create in this order: s, a, m, sa, am, sam.

There is also an EdgeNGramTokenizerFactory class which only indexes substrings at either the beginning or end of a string. It takes an extra argument of side="front" or side="back".

4 Comments »

  1. When you added the gram indexer to the analyzer node, did you remove the WhitespaceTokenizerFactory that was already there?

    Comment by Peer Allan — May 27, 2008 @ 3:24 pm

  2. I was using the ngram filter, but noticed an issue where I would index a document and try to query it back out. Words appearing before the ~ 800-1000 character mark would match and the document would get returned.

    When I turned off the ngram filter and just used a whitespace filter the search would work again.

    It appears that the ngram filter is returning a LOT of data and is overfilling the solr.TextField and the over-flow is just getting dropped.

    Have you experienced this before?

    Comment by Trevor Rowe — December 3, 2008 @ 10:27 pm

  3. Wow, I meant to say that words after the 800-1000 character mark would not match and the document would fail to return.

    This is what made me think the ngram filter was creating too much data to fit within the solr.TextField.

    Comment by Trevor Rowe — December 3, 2008 @ 10:37 pm

  4. Sorry Trevor I don’t think I have run into this. And it’s been quite a while since I worked on the project that was using this so I don’t have any helpful advice for ya :(

    Comment by coderrr — December 4, 2008 @ 7:54 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Customized Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 28 other followers

%d bloggers like this: