Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer

Using Elasticsearch 5, we had a field like drivers license number where values may include special characters and inconsistent upper/lower case behavior as the values were entered by the users with limited validation.  For example, these are hypothetical values:

  • CA-123-456-789
  • WI.12345.6789
  • tx123456789
  • az-123-xyz-456

In our application, the end user need to search by that field. We had a business requirement that user should be able to not have to enter any special characters such as hyphens and periods to get back the record.  So for the first example above, the user should be able to type any of these values and see that record:

  • CA-123-456-789 (an exact match)
  • CA123456789  (no special chars)
  • ca123456789  (lower-case letters and no special chars)
  • Ca.123.456-789 (mixed case letters and mixed special chars)

Our approach was to write a custom analyzer that ignores special characters and then query against that field.

Step 1:  Create pattern replace character filter and custom analyzer

We defined a pattern replace character filter to remove any non-alphanumeric characters as follows on the index:

"char_filter": {
    "specialCharactersFilter": {
        "pattern": "[^A-Za-z0-9]",
        "type": "pattern_replace",
        "replacement": ""
    }
}

Then we used that filter to create a custom analyzer that we named “alphanumericStringAnalyzer” on the index:

"analyzer": {
    "alphanumericStringAnalyzer": {
        "filter": "lowercase",
        "char_filter": [
            "specialCharactersFilter"
        ],
        "type": "custom",
        "tokenizer": "standard"
    }
}

Step 2: Define field mapping using the custom analyzer

The next step was to define a new field mapping that used the new “alphanumericStringAnalyzer” analyzer:

"driversLicenseNumber": {
    "type": "text",
    "fields": {
        "alphanumeric": {
        "type": "text",
            "analyzer": "alphanumericStringAnalyzer"
        },
        "raw": {
            "type": "keyword"
        }
    }
}

Step 3: Run query against new field

In our case, we have this match query as part of a boolean query in the “should” clause:

{
    "match" : {
        "driversLicenseNumber.alphanumeric" : {
            "query" : "Ca.123.456-789",
            "operator" : "OR",
            "boost" : 10.0
        }
    }
}

 

Advertisements

About stevewall123

I am a Lead Software Engineer in Minneapolis working for Thomson Reuters. I am currently working on projects using Java, JavaScript, Spring, Elasticsearch, Hazelcast, Liquibase and Tomcat. Previously, I used C#, GWT, Grails, Groovy, JMS and JBoss Drools Guvnor. In the past I have worked on projects using J2EE, Swing, Webwork, Hibernate, Spring, Spring-WS, JMS, JUnit and Ant.
This entry was posted in Elasticsearch, Uncategorized and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s