Turbocharging Django: A Deep Dive into  Optimization Strategies

Turbocharging Django: A Deep Dive into Optimization Strategies

ยท

10 min read

Whenever we are developing something, a lot of questions come to our mind. One of the questions that comes to my mind quite often is 'What if my database gets a hell lot of entries? What would happen to my APIs that fetch data from the database? Would they take a lot of time to filter and get the data? How do I handle all this?' I'm sure all the super developers out there would be encountering these questions while developing something. So, today, let's discuss how to handle all this? We'll discuss some basic strategies that we can leverage to optimise the performance of our Django applications.

Performance optimisation is process of improving the speed, responsiveness, efficiency, and overall resource utilization of a software application, so that it can handle a large number of requests and provide a responsive user experience. It could include reducing the number of database queries, using efficient query techniques, caching and so on. So, let's see how can we bring this not so complex definition to life in our favourite Django! Let's go Django!

Let's quickly start a Django project.

python -m venv env #This is a good practice, peeps
pip install django Pillow #Pillow is used to handle Image files
django-admin startproject djangooptimisation
cd djangooptimisation
python manage.py startapp home

Now, we'll add our new and shiny app named 'home' to our INSTALLED_APPS in the settings.py file

INSTALLED_APPS = [
    ....
    'home'
]

Now, let's quickly create some models in the home app's models.py file.

# home/models.py
from django.db import models

class Artist(models.Model):
    name = models.CharField(max_length=100)
    bio = models.TextField()

    def __str__(self):
        return self.name

class Album(models.Model):
    title = models.CharField(max_length=200)
    artist = models.ForeignKey(Artist, on_delete=models.CASCADE)
    genre = models.CharField(max_length=50)
    release_date = models.DateField()
    cover_image = models.ImageField(upload_to='album_covers/')

    def __str__(self):
        return self.title

Lets migrate the changes to the database.

python manage.py makemigrations
python manage.py migrate

So, we have two models Artist and Album, where Album has a Foreign Key relation with Artist. Now, let's say we have 50000 artists in our database, which indeed is a huge number, how much time is it going to take to retrieve some among them. Ummm, I can't guess, can you? Why not ingest 50000 artists in the database and see how it goes. So, we'll just write a simple test case to do so and measure the time taken.

# home/tests.py
from django.test import TestCase
from .models import Artist
import datetime

class ArtistTestCase(TestCase):
    def setUp(self):
        start_time = datetime.datetime.now()
        artists = []
        for i in range(50000):
            artists.append(Artist(name=f'Artist {i}', bio=f'Artist {i} bio goes like this'))
        Artist.objects.bulk_create(artists)
        end_time = datetime.datetime.now()
        print(f'Ingesting 50000 Artists took us: {end_time - start_time} without any optimisation')

    def test_get_artists(self):
        start_time = datetime.datetime.now()
        for i in range(12000, 14500):
            artist = Artist.objects.get(name=f'Artist {i}')
        end_time = datetime.datetime.now()
        print(f'Fetching 2500 Artists took us: {end_time - start_time} without any optimisation')

Now, we'll run the tests

python manage.py test

So, here's what I got.

So, it took us 1.065 seconds to ingest 50000 artists in the database and to fetch 2500 artists from those, it took us 9.008 seconds. The whole test took 10.079 seconds to run. The ingestion time is kind of fine, but the filtering and retrieval time is not at all good. Imagine we have 10,000,00 entries, our APIs would go crazy, man. It'd feel like waiting in a government office to get some documents. We need to optimise this, right? So, the first technique, we'll be using is Database Indexing.

Database Indexing: Imagine you have a large book with many pages, and you need to find information about a specific topic. Without an index, you would have to manually flip through each page, reading every line until you find the relevant information. This process can be time-consuming and inefficient, especially if the book is extensive.

Now, consider the book's index, which lists keywords along with the page numbers where those keywords appear. The index allows you to quickly locate the relevant pages related to a particular topic without scanning the entire book. Similarly, in a database, an index works as a structured reference to the data, enabling the database engine to locate specific rows efficiently.

So, database indexing is a technique used to optimize the performance of database queries by providing a faster way to look up and retrieve data from database tables. It involves creating an index data structure, which is a separate structure that holds a subset of the data from the main table, organized in a way that allows for efficient search and retrieval operations. This index structure serves as a roadmap to quickly locate specific rows in the table, thereby reducing the time it takes to execute queries. Without an index, a database has to perform a full table scan, examining every row, to find the requested data. With an index, the database engine can use the index structure to locate the relevant rows directly, significantly speeding up the data retrieval process.

Time to bring this theory to code. Django has made database indexing so very simple. Just a small change in the code makes a super-huge impact. We'll just add db_index=True to the name field of the Artist model and examine the performance again.

from django.db import models

class Artist(models.Model):
    name = models.CharField(max_length=100, db_index=True)
    bio = models.TextField()

    def __str__(self):
        return self.name
python manage.py makemigrations
python manage.py migrate

Now, let's run our tests again and see if there's any change

python manage.py test

Damn. I'm surprised. The time to fetch has been reduced to 0.79 seconds. Really! This is super-awesome. We can calculate the performance improvement by the formula (|New time - Old time| / Old time) \ 100%,* which, in our case, would give a result of 80.88%, which is damn awesome, but that's not all.

The second strategy that we'll use is Query Optimisation.

Query Optimisation: So, here, we'll simply try to optimise the database queries we write. How? Let's see the methods.

  • Using select_related: This is primarily used when we are dealing with models having Foreign Keys in between. In our case Album has a Many-to-One relationship with Artist. So, imagine we want to fetch some Albums and also fetch their respective Artist names. We have multiple ways to do that, but select_related is an optimal way and increases the performance. Now, we'll see how to use it and how does it improve the performance. So, for this, we'll write some more test cases and examine the performance of every case.

      from django.test import TestCase
      from .models import Artist, Album
      import datetime
    
      class AlbumTestCase(TestCase):
          def setUp(self):
              start_time = datetime.datetime.now()
              albums = []
              for i in range(50000):
                  artist = Artist.objects.create(name=f'Artist {i}', bio=f'Artist {i} bio')
                  albums.append(Album(title=f'Album {i}', artist=artist, genre=f'Genre {i}', release_date=datetime.date.today(), cover_image=f'album_covers/album_{i}.jpg'))
              Album.objects.bulk_create(albums)
              end_time = datetime.datetime.now()
              print(f'Ingesting 50000 Albums took us: {end_time - start_time}')
    
          def test_get_albums_without_select_related(self):
              start_time = datetime.datetime.now()
              for i in range(12000, 14500):
                  album = Album.objects.get(title=f'Album {i}')
                  album_artist = album.artist.name
              end_time = datetime.datetime.now()
              print(f'Fetching 2500 Albums and their Artist name without using select related took us: {end_time - start_time}')
    
          def test_get_albums_with_select_related(self):
              start_time = datetime.datetime.now()
              for i in range(12000, 14500):
                  album = Album.objects.select_related('artist').get(title=f'Album {i}')
                  album_artist = album.artist.name
              end_time = datetime.datetime.now()
              print(f'Fetching 2500 Albums and their Artist name using select related took us: {end_time - start_time}')
    

    Now, let's run the tests and see what's the result.

    Clearly, we can see that using select_related the query took lesser time than the query that didn't use select_related . Now, what's the logic behind it. It's pretty simple. The query that uses select_related does not hit the database again for executing album_artist = album.artist.name . Lesser the database hits, lesser the execution time. 4.8% optimisation might not feel big here, but as soon as the database size increases, it'd start looking like a significant change. Let's move ahead

  • Using prefetch_related: While select_related() is used with ForeignKeys, prefetch_related() is used with ManyToMany fields. Assuming our Album model has a ManyToMany field called songs, prefetch_related() would be used like this:

      albums = Album.objects.prefetch_related('songs').all()
    

    This way we don't have to query the database again to get the songs associated with a particular Album. One query does the job.

  • Retrieve whatever you need: The simplest yet efficient technique. So, if I need to fetch an Album, but the requirement is just the title and the artist. The other fields aren't required by a specific query, so why'd I fetch them. It has no complex logic behind it. Lesser the data, lesser the time taken by it to send across different services, right? But how do we do it. Simple.

      albums_data = Album.objects.only("title", "artist")
    

There are other simple techniques like using contains() instead of using conditions to improve the performance.

The next most important strategy that we'll be using is something you'd have heard about a lot, but not used that much. You might have studied this in Operating Systems or System Design, but now is the time to put it in code. Yes, I'm talking about Caching. So, let's discuss some theoretical aspects to understand it or maybe use some simple flow charts as we always do.

Caching:

Let's discuss how exactly it works. We all know that Cache is a small yet very fast memory used to store and retrieve data more quickly than fetching the data from its original source, in our case, the database. Django provides us it's own cache framework allows us to store various types of data, including database query results, HTML fragments, or any other computationally expensive or time-consuming operation results. We can use different sorts of cache like the Database Cache, which creates a cache table in the database itself or In-Memory Cache, which stores cached data in the server's memory or File-Based Cache, where cached data is stored in files on the filesystem. This is useful when you want to cache data without relying on a separate server or external service. We'll be using an In-Memory Cache. Let's see how to set it up.

In our settings.py file, we'll add the following code.

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
        'LOCATION': 'musaibs-cache', 
    }
}
# Use any unique name, my name has a copyright!

Now, we'll write a simple view to get our Albums, assuming we have a lot of them stored in the database.

from django.core.cache import cache
from .models import Album

def get_albums():
    cached_data = cache.get('albums')
    if cached_data is not None:
        return cached_data
    data = Article.objects.all()
    cache.set('albums', data, 3600)

    return data

Let's discuss how it actually works.

  1. cached_data = cache.get('albums'):

    • This line attempts to retrieve data from the cache using the key 'albums'. If the data is found in the cache, it is assigned to the variable cached_data. If not, cached_data will be None.
  2. if cached_data is not None::

    • This condition checks whether data was successfully retrieved from the cache. If cached_data is not None, it means the data was found in the cache, and the function returns the cached data.
  3. data = Album.objects.all():

    • If the data is not found in the cache, this line fetches all records from the Album model using Album.objects.all().
  4. cache.set('albums', data, 3600):

    • After fetching the data from the database, the function stores it in the cache with the key 'albums' and a timeout of 3600 seconds (1 hour). This means that the data will be kept in the cache for 1 hour before it is considered stale.
  5. return data:

    • Finally, the function returns the fetched data, whether it was retrieved from the cache or fetched from the database.

This way, we can reduce our database hits by caching our data for some time. If the client requests for the same data again, we can supply the data without a database lookup, which would significantly improve the performance.

In conclusion, optimizing a Django application is a multifaceted journey that involves thoughtful design, efficient querying, and strategic use of caching mechanisms. By delving into the intricacies of database indexing and harnessing the power of caching, we've explored ways to significantly enhance the performance of our application.

Remember, performance optimization is an ongoing process, and the strategies discussed here are just the tip of the iceberg. Continuously monitoring and analysing the application's performance, staying updated on Django best practices, and adapting to evolving requirements will ensure a robust and efficient web application.

Keep Learning, Keep Sharing, Keep Djangoing!

Go Django!

Peace, Dot!

ย