Symfony2: elasticsearch custom analyzers

In my previous posts I looked at integrating elasticsearch into a Symfony2 app and at how to use an alternative analyzer.

One thing we did not do last time was indexing the url field of the Site entity. The reason for this is that if you index urls and email addresses using the default settings they will not be split up for indexing, meaning that you cannot search on part of them. For example if we have index then a search for limethinking will not return the indexed document.

The reason for this is the way that the strings are analyzed to decide how to index them. An analyzer is a combination of a tokenizer and filters. The tokenizer decides how to split up the string to be indexed into individual tokens, the filters are then applied to the tokens before they are indexed. The default Standard Analyzer uses the Standard Tokenizer which, using language specific rules, splits up the string using whitespace and punctuation. It has the Lowercase Token filter, which lower cases the token to avoid search being case dependant, the Stop Token filter, which removes a specified set of words so that words like and, or etc., are not indexed and the Standard Token filter.

Unfortunately for us this does not work very well for the urls we want to index because the standard tokenizer does not split on dots followed by whitespace, so urls are not split up but indexed whole. Fortunately for us it is easy to change the analyzer used in the config:

So we are defining a custom analyzer under the settings section, using the name url_analyzer, under the site type we have added url to the mappings, specifying it should use this analyzer. The analyzer uses the Lowercase Tokenizer which splits on any non letter symbol as well as lower casing the resulting tokens. We are also using a couple of filters with this, a custom stop filter which removes http and https so they are not matched in searches as well as the standard stop filter. By removing http and https from the index for urls we can still get meaningful results if we search for these terms and have added sites with http or https in the title.

We now need to add searching the url field to our query from the previous post:

This adds another Text query searching the url field using the custom url_analyzer. Now all three fields of the Site entity are being searched on.