med ∩ ml

The unreasonable developer experience of SQLite

In this post, I want to share some reasons why using SQLite turns into a pretty convenient developer experience.

This is a “live” post that may be updated in the future with more contents.

Moving and sharing data, backups

Sharing SQLite data is straightforward, just rsync the file to a new VM or upload it to any file storage service. When you want to share data from other databases, it usually requires generating an SQL dump of the data. Then, as a consumer of the data, you’re responsible for starting a DB server and loading the dump. You also need to know if the DB versions are compatible. The SQLite format has been stable for years, so older versions of SQLite can open databases generated with newer versions.

As an example, let’s say the machine learning team in your company needs to query a big chunk of the database in a way that will require lots of resources. In a “normal” database deployment, you would need to:

  1. Give the team access to the database (usually with written instructions to access it once the necessary permissions have been given)

  2. Ensure their queries don’t consume computing resources which may affect users/clients

  3. Avoid changing the data/schema while the experiments are running With an SQLite database, the steps are:

  4. Attach a volume (e.g: EBS) to a VM

  5. Copy the database file to the new volume/VM Now the machine-learning team can do whatever they want, and their work won’t affect live users. Again, this can be done with other server databases, but the complexity and work required to make such a process easy and stable is not trivial.

User-defined functions

SQLite is an in-process, you can define and redefine functions. But the main advantage lies on the fact that functions are yours and won’t affect other connections. This enables separating the computation work from the actual storage. Two different connections to the same SQLite database can register functions with the same names and different code without affecting each other. And if you want a function that can be shared across connections/programming languages, you can just write a loadable extension.

Some databases like Postgres can create temporary functions, which may be used with similar purposes (I haven’t tried this). Non-temporary functions means modifying the database. This means database migrations, possible function names’ conflicts, etc. None of those are problems with SQLite, I can just query a database without paying much attention to what other people querying the same file are doing.

Example use case. Your main application has a clean_text() function that you create on the database. Later on, you need to run some analytics queries, but those require cleaning text differently. You can have a different clean_text() function per connection without conflicts:

import sqlite3

conn_main = sqlite3.connect("test.db", isolation_level=None)
conn_analytics = sqlite3.connect("test.db", isolation_level=None)

def clean_text(t):
    return t.strip().lower()

def clean_text_analytics(t):
    t = t.replace("[", "")
    t = t.replace("]", "")
    return t.strip().lower()

conn_main.create_function("clean_text", 1, clean_text)
conn_analytics.create_function("clean_text", 1, clean_text_analytics)

print(conn_main.execute("select clean_text('Foo Bar [AsD] ')").fetchall())
# [('foo bar [asd]',)]
print(conn_analytics.execute("select clean_text('Foo Bar [AsD] ')").fetchall())
# [('foo bar asd',)]

Development environment

Thanks to docker-compose and others, reproducing a production environment is a bit easier if you don’t have many external services (e.g: a single Postgres database). But nothing beats (*) having a single file you can use to hold test data and which doesn’t require running a Docker container to do any work or run some tests.

You can configure your application to generate some test data on every startup if it’s running in “development mode”. However, don’t forget to be extra careful not to drop and replace your production DB with the sample data because you forgot to turn “developer mode” off. Doing this also makes tests slower. Having a database file, already holding the test data and ready to run tests, is way faster (*). You can also forget about “developer mode”. During development, just point the connection to the “test” database, if you forget to change it in production you know you’re not dropping data.

(*): No timings are provided here, just observations based on experience. Maybe running docker-compose up and generating test data only takes 3 minutes. But now the system is more complex, it will break more often, and it will require more maintenance time, documentation, knowledge sharing, etc.