rand[om]

rand[om]

med ∩ ml

Using SQLite for logging and ad-hoc profiling

I was recently reading the documentation of nnethercote/counts, in there, Nicholas explains a few pain points when using a table as a profiling output. But using SQLite as the “table” can relieve some of those pains. I frequently use SQLite for debugging and I wanted to share some techniques that have been helpful (and relate them to some arguments mentioned in nnethercore/counts)

Defining the table and exposing it to multiple modules

These 2 points are not a problem with SQLite, the “table” is a file outside your program, any module can write to it or read from it.

Deciding where to initialize the table

Anywhere is ok. I usually have a Python object which generates the file and schema on initialization. Adding IF NOT EXISTS to the necessary SQL statements will also let us initialize the object in multiple places without issues.

Deciding where to print the contents of the table

This is one of the main reasons why I like logging to an SQLite database. As someone who loves “print debugging”, I found that sometimes I wanted to re-check some print messages, only to find that I either had to scroll through thousands of output lines, or remember to redirect the output to a file (which means re-running your program if I forget to do it).

But the biggest advantage is doing complex post-processing on the output. I am comfortable using and mixing jq/awk/rg/sort/unique etc. But once I start chaining more than 4 different commands, I prefer running SQL queries.

The database file will be available after our program has finished, so we can print or explore the logs at any time.

Writing the custom table (collection and printing)

We just need a single relation with a JSON column:

CREATE TABLE IF NOT EXISTS log(msg TEXT);

Depending on the shape of your log data, you may want to have a JSON key to differentiate events, e.g: {"event_type": "foobar", ...}. If that’s the case, you can create an index on the SQL expression used to extract that key:

CREATE INDEX IF NOT EXISTS evnt_type_idx ON log(json_extract(msg, '$.event_type'))

If you don’t need to query the database while your program is running, you can create the index after the program finishes. This will reduce the overhead of writing to the table since the index doesn’t have to be updated.

We can use other alternatives.

  1. Use VIRTUAL generated columns:
CREATE TABLE IF NOT EXISTS log(
    msg TEXT,
	event_type TEXT GENERATED ALWAYS AS (json_extract(msg, '$.event_type')) VIRTUAL
);

VIRTUAL columns don’t take space on disk and are only generated when reading from the table. We can also create the virtual column with an index after our program has finished running.

ALTER TABLE log ADD COLUMN event_type TEXT
    GENERATED ALWAYS AS (json_extract(body, '$.event_type')) VIRTUAL;

CREATE INDEX evnt_type_idx ON log(event_type);
  • Note: Generated column support was added with SQLite version 3.31.0 (2020-01-22)
  1. Use STORED generated columns:
CREATE TABLE IF NOT EXISTS log(
    msg TEXT,
	event_type TEXT GENERATED ALWAYS AS (json_extract(msg, '$.event_type')) STORED
);

STORED columns take space on disk and can’t be modified using ALTER TABLE

  1. Explicitly create a column for the event type.
CREATE TABLE IF NOT EXISTS log(
    event_type TEXT,
    msg TEXT
);

CREATE INDEX evnt_type_idx ON log(event_type);

This option requires that our logging/profiling function inserts the column correctly.

I think the simplest approach is using a table with one column and then using VIEWs. We can add indexes on expressions if we want to speed up queries.

Storing messages

We will use the following list of messages as an example. The event types are:

  • hashmap_num_keys: We intend to check the number of keys in a hash map that is modified inside a loop. We log the number of keys and the current loop iteration.
  • func_integrate: function for which we are measuring the execution time.
  • func_alloc: A function for which we want to measure the execution time as well as allocated bytes.
{"event_type": "hashmap_num_keys", "keys": 20, "loop_iteration": 4}
{"event_type": "hashmap_num_keys", "keys": 22, "loop_iteration": 5}
{"event_type": "hashmap_num_keys", "keys": 25, "loop_iteration": 6}
{"event_type": "hashmap_num_keys", "keys": 18, "loop_iteration": 7}
{"event_type": "func_integrate", "time": 0.542}
{"event_type": "func_integrate", "time": 0.342}
{"event_type": "func_integrate", "time": 0.642}
{"event_type": "func_integrate", "time": 0.2342}
{"event_type": "func_alloc", "time": 0.00342, "n_bytes": 5433}
{"event_type": "func_alloc", "time": 0.01342, "n_bytes": 54343}
{"event_type": "func_alloc", "time": 0.02342, "n_bytes": 74343}
{"event_type": "func_alloc", "time": 0.00042, "n_bytes": 743}

Creating views

create view if not exists hashmap_keys as
	select
	json_extract(msg, '$.keys') as keys,
	json_extract(msg, '$.loop_iteration') as loop_iteration
from log
where json_extract(msg, '$.event_type') = 'hashmap_num_keys';

select * from hashmap_keys where loop_iteration > 5;

Extending SQLite with Python

SQLite is a very flexible database (I will write about this in the future), we can register a custom function to do more complex tasks. For example, we may want to calculate the standard deviation. We can create an aggregate function, or just aggregate in SQLite (as a JSON array) and disaggregate plus compute in Python (hacky, but simple):

from statistics import stdev
import sqlite3
import json

conn = sqlite3.connect("logs.db", isolation_level=None, check_same_thread=False)

conn.executescript(
    """
create view if not exists func_call as
  select json_extract(msg, '$.time') as time
from log
where json_extract(msg, '$.event_type') = 'func_integrate';
"""
)

def calc_stdev(values):
    return stdev(json.loads(values))

conn.create_function(name="stdev", narg=1, func=calc_stdev, deterministic=True)

row = conn.execute("select stdev(json_group_array(time)) from func_call").fetchone()

print(row[0])

Now we can make our queries as complex as we want:

from statistics import stdev
import sqlite3
import json

conn = sqlite3.connect("logs.db", isolation_level=None, check_same_thread=False)

conn.executescript(
    """
create view if not exists func_call as
  select json_extract(msg, '$.time') as time
from log
where json_extract(msg, '$.event_type') = 'func_integrate';
"""
)

def calc_stdev(values):
    return stdev(json.loads(values))

conn.create_function(name="stdev", narg=1, func=calc_stdev, deterministic=True)

stdev = conn.execute("select stdev(json_group_array(time)) from func_call").fetchone()[
    0
]
avg_time = conn.execute("select avg(time) from func_call").fetchone()[0]
min_time = conn.execute("select min(time) from func_call").fetchone()[0]
max_time = conn.execute("select max(time) from func_call").fetchone()[0]
num_samples = conn.execute("select count(time) from func_call").fetchone()[0]

print(
    f"{avg_time:.5f} seconds +- stdev({stdev:.5f}) min({min_time}) max({max_time}). Samples: {num_samples}"
)

# Prints:
# 0.44005 seconds +- stdev(0.18544) min(0.2342) max(0.642). Samples: 4

Complexity vs. a print function

My approach is having a self-contained python class that I can just copy-paste anywhere in my application.

class L:

    import sqlite3

    conn = sqlite3.connect("logs.db", isolation_level=None, check_same_thread=False)
    conn.executescript(
        """
PRAGMA journal_mode=wal;
PRAGMA synchronous=normal;
PRAGMA temp_store=memory;
CREATE TABLE IF NOT EXISTS log(msg TEXT);
"""
    )

    @classmethod
    def log(cls, data):
        import json

        cls.conn.execute("insert into log(msg) values (?)", (json.dumps(data),))

Now we can use our module like:

L.log(dict(event_type="hashmap_num_keys", keys=20, loop_iteration=4))

Keeping the print function

You can keep using print statements and load the JSON lines to an SQLite db afterwards

print(
    json.dumps(dict(event_type="hashmap_num_keys", keys=20, loop_iteration=4)),
    file=open("logs.jsonl", mode="a"),
    flush=True,
)

Now we can load the jsonl (JSON-lines) file into SQLite and do the same operations we were doing before.

import sqlite3

conn = sqlite3.connect("logs.db", isolation_level=None)
conn.execute("CREATE TABLE IF NOT EXISTS log(msg TEXT);")

conn.execute("BEGIN;")

with open("logs.jsonl") as f:
    for line in f:
        conn.execute("insert into log(msg) values (?)", (line.strip(),))

conn.execute("COMMIT;")
conn.close()

Alternatives and update

I published this post on 12th March, 2023, but I wrote it 1 year before that. Since then, I’ve also started using DuckDB for this same purpose. The built-in JSON extension makes it easy to write JSON lines to a file and query the file directly, it’s also really fast. This is convenient because I can keep a plain-text representation of the logs. On the other hand, using SQLite lets you write the logs from different threads/processes without having to worry about locks and data races.