Solr for Drupal Developers, Part 1: Intro to Apache Solr

Posts in this series:

It's common knowledge in the Drupal community that Apache Solr (and other text-optimized search engines like Elasticsearch) blow database-backed search out of the water in terms of speed, relevance, and functionality. But most developers don't really know why, or just how much an engine like Solr can help them.

I'm going to be writing a series of blog posts on Apache Solr and Drupal, and while some parts of the series will be very Drupal-centric, I hope I'll be able to illuminate why Solr itself (and other search engines like it) are so effective, and why you should be using them instead of simple database-backed search (like Drupal core's Search module uses by default), even for small sites where search isn't a primary feature.

As an aside, I am writing this series of blog posts from the perspective of a Drupal developer who has worked with large-scale, highly customized Solr search for Mercy (example), and with a variety of small-to-medium sites who are using Hosted Apache Solr, a service I've been running as part of Midwestern Mac since early 2011.

Why not Database?

Apache Solr's wiki leads off it's Why Use Solr page with the following:

If your use case requires a person to type words into a search box, you want a text search engine like Solr.

At a basic level, databases are optimized for storing and retrieiving bits of data, usually either a record at a time, or in batches. And relational databases like MySQL, MariaDB, PostgreSQL, and SQLite are set up in such a way that data is stored in various tables and fields, rather than in one large bucket per record.

In Drupal, a typical node entity will have a title in the node table, a body in the field_data_body table, maybe an image with a description in another table, an author whose name is in the users table, etc. Usually, you want to allow users of your site to enter a keyword in a search box and search through all the data stored across all those fields.

Drupal's Search module avoids making ugly and slow search queries by building an index of all the search terms on the site, and storing that index inside a separate database table, which is then used to map keywords to entities that match those keywords. Drupal's venerable Views module will even enable you to bypass the search indexing and search directly in multiple tables for a certain keyword.

So what's the downside to database-backed search? Mainly, performance. Databases are built to be efficient query engines—provide a specific set of parameters, and the database returns a specific set of data. Most databases are not optimized for arbitrary string-based search. Queries where you use LIKE '%keyword%' are not that well optimized, and will be slow—especially if the query is being used across multiple JOINed tables! And even if you use the Search module or some other method of pre-indexing all the keyword data, relational databases will still be less efficient (and require much more work on a developer's part) for arbitrary text searches.

If you're simply building lists of data based on very specific parameters (especially where the conditions for your query all utilize speedy indexes in the database), a relational database like MySQL will be highly effective. But usually, for search, you don't just have a couple options and maybe a custom sort—you have a keyword field (primarily), and end users have high expectations that they'll find what they're looking for by simply entering a few keywords and clicking 'Search'.

Why Solr?

What makes Solr different? Well, Solr is optimized specifically for text-based search. The Lucene text search engine that runs behind Apache Solr is built to be incredibly efficient and also offers some other really useful tools for searching. Apache Solr adds some cool features on top of Lucene, like:

Efficient and fast search indexing.
Simple search sorting on any field.
Search ranking based on some simple rules (over which you have complete control).
Multiple-index searching.
Features like facets, text highlighting, grouping, and document indexing (PDF, Word, etc.).
Geospatial search (searching based on location).

Some of these things may seem a little obtuse, and it's likely that you don't need every one of these features on your site, but it's nice to know that Solr is flexible enough to allow you to do almost anything you want with your site search.

These general ideas are great, but in order to really understand what benefits Solr offers, let's look at what happens with a basic search in Apache Solr.

Simple Explanation of how Solr performs a search

This is a very basic overview, leaving out many technical details, but I hope it will help you understand what's going on behind the scenes at a basic level.

When searching with a database-backed search, the database says, "give me a few keywords, and I'll find exact matches for those words," and it only covers a few very specific bits of data (like title, body, and author). Searching with Solr is more nuanced, flexible, and powerful.

Step 1 - Indexing search data

First, when Solr builds an index of all the content on your site, it gathers all the content's data—each entity's title, body, tags, and any other textual information related to the entity. While reading through all this textual information, Solr does some neat things, like:

Stemming: taking a word like "baseballs" and adding in 'word stems' like "baseball".
Stop Word filtering: Removing words with little search relevance like "a", "the", "of", etc.
Normalization: Converting special characters to simpler forms (like ü to u and ê to e so search can work more intuitively).
Synonym expansion: Adding synonyms to words, so the words "doctor" and "practitioner" could be equivalent in a search, even if only one word appears in the content.

These functions are collectively known as tokenization, and are actually performed by Lucene, the engine running under Solr. You don't need to know what all this means right now, but basically, if your content has the word "baseball" in it, and a user searches for "baseballs" or "stickball", the "baseball" result will be returned.

Step 2 - Searching with keywords

Second, when someone enters keywords to perform a search, Solr does a few things before it starts the actual search. We'll take the example below and run through what happens:

Baseball hall of fame

The first thing Solr does is splits the search into groupings: first the entire string, then all but one word in every combination, then all but two words in every combination, and so on, until it gets to individual words. Just like with indexing, Solr will even take individual words like "hall" and split that word out into "halls", "hall", etc. (basically any kind of related term/plural/singular/etc.).

So now, at this point, your above search looks kind of like you actually searched for:

"baseball hall of fame"
"baseball hall"
"baseball fame"
"baseballs"
"halls"
...
"baseball"

I've skipped many derivatives for clarity, but basically Solr does a little work on the entered keywords to make sure you're going to get results that are relavant for the terms you entered.

Step 3 - Executing the search

Finally, the search engine takes every one of the parsed keywords, and scores them against every piece of content in the index. Each piece of content then gets a score (higher for the number of possible matches, zero if no terms were matched). Then your search result shows all those results, ranked by how relevant they are to the current search.

If you had an entity with the title "Baseball Hall of Fame", it's likely that would be the top result. But some other content may match on parts or combinations of the keywords, so they'll also show up in the search.

If you know better than the search engine, and only want results that exactly match your search, you can enclose your keywords in quotes, so you would only get results with the exact string baseball hall of fame, and nothing that mentions 'hall of fame' or 'baseball' independently.

Solr also adds in a few nifty features when it returns the search results (or lack thereof); it will give back spelling suggestions, which are based on whether any words in the search index are very close matches to the words or phrase you entered in the keywords, and it will also highlight the matched words or word parts in the actual search result.

Summary

In a nutshell, this post explained how Apache Solr works by indexing, tokenizing, and searching your content. If you read through the entire post, you even have a basic understanding of Levenshtein distance, approximate string matching, and concept search, and can get started building your own Google :)

I'll be diving much more deeply into Apache Solr as time allows, highlighting especially the past, present, and future of Apache Solr and Drupal, as well as ways you can make Apache Solr integrate more seamlessly and effectively with your site, perform better, and do exactly what you want it to do.