Multiple partial word search Sitecore How-To

Sitecore Multiple Partial Word Search with SOLR

In this article I want to demonstrate, how to implement a full phrase search in context of using Sitecore with SOLR search engine.

When we want to apply the full content search, in most cases, it is enough to use the .Where(item => item.Content.Contains(query)) query, where the “query” is a phrase from a query textbox and can represent one word or two or entire sentence.

Imagine, we have an item with “Very simple Sitecore item” content and try to search them. If we try to search by “site”, “simple”, “Sitecore item” or even “Very simple Sitecore item”, the SOLR will return the item. It works fine! But, if If we try to search by “simple sit”, it will not return anything. Doesn’t SOLR support searching by part of word? Yes and no at the same time. Let's explore...

As it is mentioned in documentation, Solr does not support the wildcard (say partial) query for search phrases (this is a type of query to which .Contains("simple site") will be converted):

Solr’s standard query parser supports single and multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases. https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html

When you try to search "sit", your query is parsed as <str name="parsed query">content:sit </str> and Solr searches for "sit" as a part of word or an entire word and it works fine. But when you try to search "simple site", your query is parsed as <str name="parsedquery">PhraseQuery(_content:"simple sit") </str>. In this case, Solr searches of containing the "simple" and "sit" just as entire words in the document and you don't get any results, because you don't have a document with "sit" word in the content.

The easiest way to supporting multi word search with search by the part of word is to apply the code like below:

   
using (var searchContext = ContentSearchManager.GetIndex(new SitecoreIndexableItem(startItem)).CreateSearchContext()) 
{

var querySplitted = queryItem.Split(' '); //Split the queryItem by the white space (or any other symbols if you need) 

var predicate = PredicateBuilder.True <TIndexModel>(); 

foreach (var query in querySplitted) 
{ 
    predicate  = predicate.And(item => item.Content.Contains(query)); 
} 

var query = searchContext.GetQueryable<SearchResultItem>().Filter(predicate);
}
 

The code above works as expect.

Only one thing that you need to know. The code above doesn’t take into account the word order in a search request.

Another way is to change the tokenizer.

Currently, you use the StandardTokenizerFactory for both "index" and "query" analyzers. This tokenizer splits our "Very simple sitecore item" into the folloing keywords during indexing the field: "very", "simple", "sitecore", "item".

If you want to search by parts of a word, you need apply, for example, N-Gram Tokenizer, <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="5"/>, which will divide our "Very simple sitecore item" phrase to the following parts during indexing time: ver very ery sim simp simpl imp impl imple mpl mple ple sit site sitec ite itec iteco tec teco tecor eco ecor ecore cor core ore item tem. Now you will be able to search by the part of word because index currently contains all possible parts of word whose has length more than three symbol. You need apply the NGramTokenizerFactory for both analyzers:

   

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="false"> 
    <analyzer type="index"> 
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize=  "5"/> 
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 
        <filter class="solr.LowerCaseFilterFactory"/> 
    </analyzer> 
    <analyzer type="query"> 
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="5"/> 
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 
        <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> 
        <filter class="solr.LowerCaseFilterFactory"/> 
    </analyzer> 
</fieldType>   
 

If you're using NGramTokenizerFactory, you need to insert == operator instead of .Contains(). For example: predicate.Or(item => item.Content == query) .

In case you change the analyzer to NGramTokenizerFactory be careful as the size of index will increase and potentially can cause a performance issue. Make sure you know what you're doing before changing of analyzer.

I have chosen the NGramTokenizerFactory only in example reason. SOLR has much more different tokenizers such Edge N-Gram Tokenizer, that also match to be used in partial search goal. You can find a full list of tokenizers by the link: https://lucene.apache.org/solr/guide/6_6/tokenizers.html

Artsem Prashkovich - Sitecore MVP/ Lead Developer

artem prashkovich

Artsem Prashkovich

Artsem is a Sitecore MVP and Lead developer at Brimit working on Sitecore projects since 2012. Sitecore certified developer. Certified Scrum Master.