rand[om]

rand[om]

med ∩ ml

DuckDB as a vector database

DuckDB version 0.10.0 has been released, bringing some new array functions. You can use them to turn DuckDB into a vector database. (Make sure you have at least version 0.10.0 of DuckDB version to follow python3 -c "import duckdb; print(duckdb.__version__)")

Insert data from numpy arrays

import duckdb
import numpy as np

conn = duckdb.connect(database=":memory:", read_only=False)

conn.execute("CREATE TABLE data (id INTEGER, vector FLOAT4[768]);")


def normalize(vec: np.ndarray) -> np.ndarray:
    return vec / np.linalg.norm(vec)


# insert 3000 random vectors, each vector has 768 dimensions
for i in range(3000):
    vector = np.random.rand(768).astype("float32")
    # normalize vector before inserting
    norm_vector = normalize(vector)
    conn.execute("INSERT INTO data VALUES (?, ?)", (i, vector))

Some notes about datatypes:

  • DuckDB FLOAT4: single precision floating-point number (4 bytes) source
  • Numpy float32: Single-precision floating-point number type / 32-bit-precision floating-point number type. source

So a DuckDB FLOAT4 = numpy.float32

Query the data

We need to explicitly cast the numpy array to FLOAT4[768].


query_vector = query = np.random.rand(768).astype("float32")
norm_query_vector = normalize(query_vector)

result = conn.execute(
    "select id, array_cosine_similarity(vector, $query_vector::FLOAT4[768]) as cosim from data order by cosim desc limit 10",
    {"query_vector": norm_query_vector},
).fetchdf()

This will give us a table of the top 10 most similar vectors, sorted by cosine similarity.

print(result.to_string())
     id     cosim
0   903  0.791789
1  1158  0.787012
2   822  0.783325
3  1266  0.782855
4  1178  0.782746
5  1728  0.782400
6   473  0.781506
7  2116  0.780277
8  2938  0.780260
9   388  0.780072