7x performance improvement for my dead slow SQL Server + Django pyodbc queries

In one of my Django projects, I'm connecting to a SQL server database, and I'm doing this with django-pyodbc-azure. I noticed that my query performance was incredibly slow in a lot of places.

For a simple query where I was just selecting 50 rows, it would be taking like 11 seconds to complete. That's after making sure all of the relevant columns were indexed. At first I thought that maybe the Django Rest Framework performance was a lot worse than I remembered. Pagination, maybe? But digging in, it became clear that it was just the query execution.

Alright, so here's an example query:

SELECT TOP 50 * FROM Customer WHERE email LIKE '%you@gmail.com%'  

The email itself was actually parameterized in Django/pyodbc, so it would be sent more like the following pseudocode:

params = ('you@gmail.com',)  
query = "SELECT TOP 50 * FROM Customer WHERE email LIKE '%' + %s + '%'"  
rows = cursor.execute(query, params)  

It was strange because I'd try hardcoding the parameters in the query so it would be more like the top one and the performance was fine. Thus it had to be something with the parameters.

And it was! A Google query led me to some suspicious looking unicode talk in the pyodbc wiki. My database is encoded as something like latin1, while pyodbc converts everything to unicode by default. Meaning that, db-side, parameters were being cast from utf8 strings to latin1, which kills the performance of the indexes. This article pretty much gives the fix. Essentially:

Check your SQL Server collation using:

select serverproperty('collation')

If it is something like "SQLLatin1GeneralCP1CI_AS" and you want str results, you may try:

cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1', to=str)  
cnxn.setencoding(str, encoding='latin1')  

And that was it, really. But how to apply that to django-pyodbc-azure? I ended up overriding the db engine. Django expects you to put this as base.py in its own package.

# utils.db_engine.base

from datetime import datetime

import pyodbc  
from sql_server.pyodbc.base import DatabaseWrapper as PyodbcDatabaseWrapper


# Inspired by https://github.com/michiya/django-pyodbc-azure/issues/160

class DatabaseWrapper(PyodbcDatabaseWrapper):

    def get_new_connection(self, conn_params):
        """
        Need to do this because pyodbc sends query parameters as unicode by default,
        whereas our server is latin1.

        Refs:
        [1]: https://github.com/mkleehammer/pyodbc/wiki/Unicode
        [2]: https://github.com/mkleehammer/pyodbc/issues/376
        [3]: https://github.com/mkleehammer/pyodbc/issues/112
        """
        connection = super().get_new_connection(conn_params)
        connection.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
        connection.setencoding(encoding='latin1')

        return connection

And then change your settings

# settings.py

DATABASES = {  
    'default': {
        'ENGINE': 'utils.db_engine',
        ...
    }
}

And that was pretty good. That brought lots of queries down to milliseconds. But a few of them were still disturbingly high - particularly the ones without many results. Some of those still took about 3 seconds.

Can you guess what the issue is?

Let's look at the query again:

SELECT TOP 50 * FROM Customer WHERE email LIKE '%you@gmail.com%'  

The leading wildcard essentially turns the query into a full-text search, which also kills the index. Turns it into a scan instead of a seek, which isn't nearly as good. There aren't a lot of great options for this in SQL Server. You can create a full-text index on all of the fields where you'll be doing these types of searches. In my case, I decided that it's probably fine to just do a startswith query and drop the leading wildcard. In Django Rest Framework, this was pretty easy. In my viewset, I just added a carrot to the beginning of my search_fields.

class CustomerViewSet(viewsets.ReadOnlyModelViewSet):

    search_fields = ['^email', '^first_name', '^last_name']

And then I was happy.