How to Find Items That Are Semantically Close to the Current Item in Sitecore Cortex

Problem: How to find items that are semantic close to current item in sitecore?

What does “semantic” means?

Semantic search denotes search with meaning, as distinguished from lexical search (where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query). Semantic search systems consider various points including context of search, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results.

Sitecore 9 and higher includes Content tagging feature. By using this feature, you can assign semantic tags to sitecore items and use these tags in future for items search/comparison.

But If you use Content tagging for these purposes you can faced following troubles:

  1. Unique tags: Some items with semantic close content have different tags (no intersection between tags of these items). Sometimes items have different tags, but these tags are semantic close one to another (like “fruit” and “juice”).
  2. Zero results: For example, if you always need to return TOP-5 semantic related items, but for current item you don`t have any items with tags intersections, you don`t know what to return.
  3. Boosting: If several items have the same count of equal tags there are no way to order them by semantic similarity.
  4. Third-party services: Content Tagging uses third-party service (Open Calais) and it can be a problem.
  5. Not english content: Content Tagging works fine if you have English version of your content, but if you don`t it can be a problem.

Technical background for semantic search:

How to compare texts? What is vector vocabulary and what it is needed for?

Vector Vocaburary

Vector vocabulary is a neural network trains model that represents each word from big dataset as a vector in a multidimensional space (300 and more). Each word is a point in this 300-dimensional space. Each text is a sum of words and also is a single point in 300-dimensional space (sum of word vectors).

The more semantic similar are two texts, the closer will their point in this 300-dimensional space. And vice versa. For each separate text we can calculate its point in this multidimensional space (float[300]) and compare two texts as distances between their points.

Solution:

Let`s implement our own module with similar to Content Tagging behavior, but adapted for our requirements.

  1. First of all, we need a vector vocabulary that we will use for semantic analysis. You can find many of them in internet, but I recommend to use fasttext by facebook. Here you find a lot of open source datasets provided by facebook and even language specific datasets. We use Google News dataset because is lightweight and works fine for our purposes.
  2. Next, we need UI for Sitecore: We create UI that is similar to Content Tagging.

      - We want to have ability to choose templates where we will find related items.

      - Optional we want to manage if needed “Minimum similarity” for related items in percent (from 0 to 100), if we want to avoid related items with small similarity (if there are no items with big similarity). I recommend to use 70% and higher.
  3. We need choose how and where we will process and store information. Solr index in the best candidate for this. We create separate index for items semantic storing “sitecore_related_content_index”.

    For each item we need to store only float[300] value instead of tons of text. This value will store coordinate in 300-dimentional space for our item. To do the logic we need a computed field:

    
    <fields hint="raw:AddComputedIndexField">
          <field fieldName="vector"  returnType="floatCollection">Semantic.Foundation.RelatedContentTagging.Indexing.Vector, Semantic.Foundation.RelatedContentTagging</field>
    </fields>
    
    

    In this computed field logic we need extract all text content from item (excluding standard fields, and including datasource items if it is needed, we do it manageable in our module configs). Then we need recalculate “vector” for item and save it in Solr:

    
        public class Vector : IComputedIndexField
        {
            public object ComputeFieldValue(IIndexable indexable)
            {
                var item = (Item)(indexable as SitecoreIndexableItem);
    
                if (item == null)
                {
                    return null;
                }
    
                var vector = GetItemVector(item);
    
                return vector;
            }
    
            public float[] GetItemVector(Item item)
            {
                var messageBusFactory = ServiceLocator.ServiceProvider.GetService<IMessageBusFactory>();
                var messageBus = messageBusFactory.Create();
                BaseCorePipelineManager PipelineManager = ServiceLocator.ServiceProvider.GetService<BaseCorePipelineManager>();
                string pipelineDomain = "RealtedContentTagging";
    
                var configurationArgs = new GetRelatedContentTaggingConfigurationArgs
                {
                    MessageBus = messageBus
                };
                PipelineManager.Run("getRelatedTaggingConfiguration", configurationArgs, pipelineDomain);
                BaseCorePipelineManager pipelineManager = PipelineManager;
                var tagContentArgs = new RelatedContentTagArgs
                {
                    Configuration = new RelatedItemContentTaggingProvidersSet
                    {
                        ContentProviders = configurationArgs.ProvidersSet.ContentProviders,
                        Taggers = configurationArgs.ProvidersSet.Taggers,
                        DiscoveryProvider = configurationArgs.ProvidersSet.DiscoveryProvider,
    
                    },
                    ContentItem = item,
                    MessageBus = messageBus
                };
    
                pipelineManager.Run("getContent", tagContentArgs, pipelineDomain);
                return tagContentArgs.Vector;
            }
    
            public string FieldName { get; set; }
            public string ReturnType { get; set; }
        }
    
    

    Now items vector is recalculated on item:save event and vectors are updated in real-time.

  4. To find related semantic items we need to select item in sitecore tree and click "Related for item" or "Related for item with subitems" command. Related items filed will populated with related items automatically.

Important: if you use items serializers like Unicorn/TDS etc., make sure that index rebuild after sync is configured in their config files. Or, alternatively, rebuild "sitecore_related_content_index" manually after synchronization.

Source code:

Full source code of module is available on Github.

If you want to know more about semantic search, see my meetup presentation with demo: