Symfony2: elasticsearch custom analyzers

In my previous posts I looked at integrating elasticsearch into a Symfony2 app and at how to use an alternative analyzer.

One thing we did not do last time was indexing the url field of the Site entity. The reason for this is that if you index urls and email addresses using the default settings they will not be split up for indexing, meaning that you cannot search on part of them. For example if we have index http://www.limethinking.co.uk then a search for limethinking will not return the indexed document.

The reason for this is the way that the strings are analyzed to decide how to index them. An analyzer is a combination of a tokenizer and filters. The tokenizer decides how to split up the string to be indexed into individual tokens, the filters are then applied to the tokens before they are indexed. The default Standard Analyzer uses the Standard Tokenizer which, using language specific rules, splits up the string using whitespace and punctuation. It has the Lowercase Token filter, which lower cases the token to avoid search being case dependant, the Stop Token filter, which removes a specified set of words so that words like and, or etc., are not indexed and the Standard Token filter.

Unfortunately for us this does not work very well for the urls we want to index because the standard tokenizer does not split on dots followed by whitespace, so urls are not split up but indexed whole. Fortunately for us it is easy to change the analyzer used in the config:

So we are defining a custom analyzer under the settings section, using the name url_analyzer, under the site type we have added url to the mappings, specifying it should use this analyzer. The analyzer uses the Lowercase Tokenizer which splits on any non letter symbol as well as lower casing the resulting tokens. We are also using a couple of filters with this, a custom stop filter which removes http and https so they are not matched in searches as well as the standard stop filter. By removing http and https from the index for urls we can still get meaningful results if we search for these terms and have added sites with http or https in the title.

We now need to add searching the url field to our query from the previous post:

This adds another Text query searching the url field using the custom url_analyzer. Now all three fields of the Site entity are being searched on.

Symfony2: improving elasticsearch results

In my previous post I looked at integrating elasticsearch into a Symfony2 app using Elastica and the FOQElasticaBundle bundle. By the end we were indexing a Site entity and performing basic searches against the index. In the post I will look at improving how we index and search the Site entities.

We can improve the indexing of the name and keywords by switching to a different analyzer. Currently we are only going to find whole word matches, for example, if we index Lime Thinking as a site name then it will be found by a search for thinking, but not think or thinks. We can change this by instead using the snowball analyzer, this is a built in analyzer which is the same as the standard analyzer but with the edition of the snowball filter which stems the tokens. This means that words are indexed as their stems, so thinking will be indexed as think. We can then find it with searches for words such as think, thinks and thinkings. I will have a more detailed look at analyzers and filters in a future post.

We just need to make a small config change to start using this analyzer for indexing:

We need to make some further changes though to get the benefits of this. We also need to make sure that the search terms are analyzed with the same analyzer as the indexed field. If this does not happen we will only get matches if we search for the stemmed token e.g. think will find Lime Thinking but thinking will not. Our simple query does not specify which field we are searching, this means its searches the built in _all field which, unsurprisingly, contains all the fields. This means we cannot use different analyzers for searching different fields. We are going to want to add the url at some point using a different analyzer so we need to specify each field we want to search separately.

So we now need to split up our query into several parts. For this we need to use Elastica’s query builder objects. To search on a specific field we can use a Text query, so to search on the name field we use:

Notice that we pass the query object into the same method on the finder as before, this method accepts both simple search strings as well as queries built through objects. According to the elasticsearch documentation the analyzer will default to the field specific analyzer or the default one, to me this suggests that the above query will automatically use the analyzer set for the field. However this does not work for me, fortunately it easy to specify the analyzer to use for the field:

Our current query will of course only search the name field, what we want to do is search the name field and the keywords field using the snowball analyzer. This is done by creating another query as above for the keywords field and then using a boolean query to combine the two individual queries into one query:

Whilst this looks complicate each constituent part is simple and this is a good way to build more complicated queries.

A really helpful recent inclusion to the bundle is logging to the web profiler toolbar so you can see the parsed JSON query that is sent to elasticsearch. The combined query from above looks like this:

We have seen Text query and Boolean query here, these are just a few of the available query types. There is more information on each in the elasticsearch documentation. There is little in the way of documentation for the Elastica objects for creating these query types but the test suite provides quite a lot of example of putting them to use.

Symfony2: Integrating elasticsearch

Over a short series of posts I am going to have a look at using elasticsearch with Symfony2.

Elasticsearch is built on top of Lucene and indexes data as JSON documents in a similar way to the way MongoDB stores data. This means as with Mongo that it is schemaless and creates fields on the fly. It is queried over HTTP using queries which are themselves defined in JSON. I am not going to go into details about using elasticsearch in this way, there is plenty of information in its online documentation.

Reading through the documentation makes it look as though there is a steep learning curve to getting started with elasticsearch. What I want to do is look at how you can avoid having to deal with issuing JSON queries over HTTP from a Symfony2 app and actually get started using elasticsearch in a very simple way. This is possible by using Elastica, a PHP library which abstracts the details of the queries, along with the FOQElasticaBundle which integrates Elastica into Symfony2 applications. This is not just a basic wrapper though to make Elastica into a Symfony2 service, the integration with Doctrine to make indexing of ORM entities or ODM documents is fantastic and what I am going to look at here.

To get started you need to install elasticsearch itself, as well as installing Elastica and the FOQElasticaBundle in the usual way.

As an example of how easy the integration is I will look at a very basic application for bookmarking sites and searching for them. For simplicity’s sake we are just going to have a single entity to model each site, it is just a name, the URL and some keywords stored as a comma separated list. So here it is as a Doctrine entity class:

We can then set up the bundle to index the fields of our entity. By choosing to use the integration with doctrine we can make this very simple:

Whilst there are quite a few settings here it is fairly straight forward. The client just sets the port to use for the http communication. The bookmarks setting under indexes is the name of the index we will create. Within each index you can have types for each of your entity types, we just have the one type (site) here at the moment.

We have specified that we are using the ORM, the entity class and which fields to map, for now just the name and keywords (I will return to indexing in the url in my next post). That is enough to get any existing Sites stored in the database into the search index. Running the following console command will do this:

It is as easy as that! All the sites already stored in the database are now indexed without the need for even writing any code, just a small amount of configuration. Great as that is, it would be even better if we could automatically index any new entities, as well as updating and removing entities as they are updated and removed from the database without having to rerun the console command. This is just as easy to achieve with only one extra item added to the configuration:

This enables the bundle’s built in doctrine event listeners which will then do just that, keep the search index up to date with any changes we make to the entities, again without any additional code needed in typical CRUD controllers.

Before looking at searching the index there is one more bit of config which can be added to make integration easy:

By adding the finder line we activate the support for returning the search results as Doctrine entities, so the bundle will do the work of fetching the relevant entities from the database after querying the elasticsearch index.

So how do we query the index? The bundle dynamically creates a service you can request from the container with the format foqelastica.finder.index-name.type-name. These match the values in our config, so the service we need is foqelastica.finder.bookmarks.site. We can now issue queries using this service:

Elastica provides an OO query builder for creating more complicated queries but I will leave that for another day. Hopefully I have shown just how straightforward it is to get stated using elasticsearch with a Symfony2 app. As always, it is not limited to such simplicity and you can override these built in services to provide your own providers, finders and listeners if you have more complex requirements.