noizeramp

Aleksey's programming, engineering and electronics log

Scripted sorting in Elasticsearch

Aleksey Gureev
22 October 2020 ⋅ rails

In some of our projects where we need flexible fuzzy search capabilities with weighted results we use Elasticsearch. It indexes everything you throw at it, then gives you tools to search it flexibly.

In this short article I share the approach we used to sort the weighted output additionally basing on certain criteria.

In our product we have articles. A lot of them. They are authored by users and some of them relate to a certain product. The idea was to prioritize the search results in a way that own aricles come first, then product-related ones and finally the rest of the flock. The relevance score of your own article can be much lower than the rest, but since it’s in the results and it’s yours we show it at the top.

  • ES has filters to cut out unnecessary records that are definitely invalid, so that it even doesn’t evaluate the score for them.

  • It has queries that are used to evaluate the score of each record. There can be several queries that you can boost to give them more or less weight.

  • There are compound queries that you use to add a layer on top of your ordinary queries. Combine, switch and change behaviour of your queries.

  • You can also use sort to order the final scored result. As you would expect, it can be ascending or descending and there can be multiple clauses.

Applying this to the task, we have a certain query that we run over our index to find articles which title match the user provided query string the most, like this:

dis_max: {
  queries: [
    { match: { title: { query: query, fuzziness: 'AUTO' } } },
    { prefix: { title: { value: query, boost: 0.001 } } }
  ]
}

Here we have two queries wrapped into disjunction max query that picks the maximum from all sub-queries and a tie breaking value. We have fuzzy match for the whole word and slightly less scored prefix-match.

Now we need to order the results so that own articles come first. We need a user_id field in the index, but how to use it. Elastic does not normalize the score. It has the Float type, but no definite range, so boosting is out of question. The next pretendent is sorting.

Luckily Elastic has scripting support in many areas. Here’s what I came up with:

{
  sort: [
    {
      _script: {
        type: 'number',
        script: {
          lang: 'painless',
          source: """
            int priority = 0;

            if (doc['category'].value == 'product') {
              priority += 1;
            }

            if (doc['user_id'].value == params.user_id) {
              priority += 2;
            }

            return priority;
          """,
          params: {
            user_id: user&.id || ''
          }
        },
        order: 'desc'
      }
    }
    { _score: :desc }
  ]
}

In the scripted sort-clause we have a painless script where we set the default order to 0, product articles get additional 1 point, own articles get additional 2 points. We sort by this priority and then by the relevance score.

NOTE 1: When indexing, user_id and category fields must be defined as keyword. The title type is text. This is to make it more memory-efficient and not spend time on analyzing and indexing individual words in user_id. We use UUIDs there and it produces the array of words that won’t match later in the script.

NOTE 2: Don’t inline parameter values, use params. Elastic pre-compiles the scripts for efficiency. If it sees that it has to compile a lot of different scripts quckly, bad things may happen. Read more here – How to use scripts.



Comments