DuckDB as a vector database
Table of contents
DuckDB version 0.10.0 has been released, bringing some new array functions. You can use them to turn DuckDB into a vector database. (Make sure you have at least version 0.10.0 of DuckDB version to follow python3 -c "import duckdb; print(duckdb.__version__)"
)
Insert data from numpy arrays
import duckdb
import numpy as np
conn = duckdb.connect(database=":memory:", read_only=False)
conn.execute("CREATE TABLE data (id INTEGER, vector FLOAT4[768]);")
def normalize(vec: np.ndarray) -> np.ndarray:
return vec / np.linalg.norm(vec)
# insert 3000 random vectors, each vector has 768 dimensions
for i in range(3000):
vector = np.random.rand(768).astype("float32")
# normalize vector before inserting
norm_vector = normalize(vector)
conn.execute("INSERT INTO data VALUES (?, ?)", (i, vector))
Some notes about datatypes:
- DuckDB
FLOAT4
: single precision floating-point number (4 bytes) source - Numpy
float32
: Single-precision floating-point number type / 32-bit-precision floating-point number type. source
So a DuckDB FLOAT4
= numpy.float32
Query the data
We need to explicitly cast the numpy array to FLOAT4[768]
.
query_vector = query = np.random.rand(768).astype("float32")
norm_query_vector = normalize(query_vector)
result = conn.execute(
"select id, array_cosine_similarity(vector, $query_vector::FLOAT4[768]) as cosim from data order by cosim desc limit 10",
{"query_vector": norm_query_vector},
).fetchdf()
This will give us a table of the top 10 most similar vectors, sorted by cosine similarity.
print(result.to_string())
id cosim
0 903 0.791789
1 1158 0.787012
2 822 0.783325
3 1266 0.782855
4 1178 0.782746
5 1728 0.782400
6 473 0.781506
7 2116 0.780277
8 2938 0.780260
9 388 0.780072