rand[om]

rand[om]

med ∩ ml

Understanding Bloom Filters by building one

What is a Bloom Filter?

A Bloom filter is a probabilistic data structure. It tells you if an element is in a set or not in a very fast and memory-efficient way. A Bloom filter can tell if an element is not in the set (“being 100% sure”) or that it may be in the set, but not “being 100% sure”. It only has 2 operations: add, to add an element, and query, to check if an element exists in the set or not.

Why are they useful?

Bloom filters are some kind of data compressing. With them, you can check if an element exists in a set, but using a lot less space than if you had to store the complete dataset. For example, your browser may check if the URL you are trying to visit is a malicious one. You can use a bloom filter instead of storing all the URLs, when a user tries to visit a page, it checks if the URL is inside the set using a Bloom filter (set of malicious URLs). If the answer is False (remember, that’s with 100% probability), we can safely let the user visit that page.

How does a Bloom filter work?

First, we need a Bit Vector. That Bit Vector will hold the information about or data. We must decide how big/long the vector will be (we will explain more about this later). We will call the vector size: length, and we will use a length = 10.

Initially, every element in our vector is set to 0.

vector = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
12345678900000000000

To store data inside the bit vector we need to preprocess it. We need to pass the data through X number hash functions, those functions will return integers. The hash functions we use must be as fast as possible. Some common hash functions that we can use are MurMurHash or xxHash. Both have Python bindings that we can use for our implementation.

In this example we will use X = 3 hash functions.

Note: Since this post is about explaining how Bloom Filters work, I will use the hash functions included in the Python standard library, that way it will be easier to play with the examples without having to install anything. Those hash functions are not as fast as we need, but are good to code the examples.

Anyway, here’s a short snippet if you want to use the fast functions mentioned before:

# MurMurHash, xxHash and the built-in `hash()` function in Python.
import murmurhash
import xxhash

data = "this is a message with some data"

hash_value1 = xxhash.xxh32_intdigest(data)
hash_value2 = murmurhash.hash(data)
hash_value3 = xxhash.xxh64_intdigest(data)


hash_value1, hash_value2, hash_value3
(773648684, 2013144662, 11112426289249758424)

Python comes with the hashlib module. We can use some functions found in there.

from hashlib import sha1, md5, sha256
data = "this is a message with some data"

To get an integer from our hash function we need to get the hash digest (the bytes that the hash function returns), and convert those to integer. We can do it like this.

def sha1_int(data: bytes):
    return int(sha1(data).hexdigest(), base=16)


def md5_int(data: bytes):
    return int(md5(data).hexdigest(), base=16)


def sha256_int(data: bytes):
    return int(sha256(data).hexdigest(), base=16)

Note: the functions in the hashlib module need the input data as bytes, not string. We can use data.encode() to do the conversion.

hash_value1 = sha1_int(data.encode())
hash_value2 = md5_int(data.encode())
hash_value3 = sha256_int(data.encode())  # built-in python hash() function


hash_value1, hash_value2, hash_value3
(745951622377321548266496312092986415888481579279,
 42524137162357419646727732885791601796,
 27838692698195609171558989627466754261699905594803567920934777340154558766262)

We have 3 integers, the outputs of each one of the hash functions. However, for a Bloom filter to be efficient, the outputs of the hash functions should be evenly distributed between 0 and our vector length (10). To achieve that, we take the modulo (m) of each integer. The modulo value (m) is equal to the vector length. In our example:

length = 10
modulo = 10

We will get 3 new integers, each one will represent the index in our vector.

vector_lenght = 10
modulo = 10

hash_values = [hash_value1, hash_value2, hash_value3]

indexes = [h % modulo for h in hash_values]

print(indexes)
[9, 6, 2]

Now that we have 3 vector indexes, we need to set those indexes to 1.

for index in indexes:
    vector[index] = 1


print(vector)
[0, 0, 1, 0, 0, 0, 1, 0, 0, 1]
12345678900100010010

Using the Bloom filter

Now that we have added an element to our filter, we can query it to check if an element exists or not. To do that, we need some data as an input, then we will repeat the same operations we did before to get 3 indexes. If all the indexes in our vector are == 1, then we can say that maybe the element is in the set, however, as soon as one of those indexes is == 0, we can say that there’s a 100% probability that the element is not in the set. To make things easier we will write a function to get some indexes.

from typing import List
def get_idx_from_data(data: bytes) -> List[int]:
    hash_value1 = sha1_int(data)
    hash_value2 = md5_int(data)
    hash_value3 = sha256_int(data)

    hash_values = [hash_value1, hash_value2, hash_value3]

    indexes = [h % modulo for h in hash_values]

    return indexes

Let’s say we want to check it the element "this is a message with some data" is in our set. First, we get the indexes, and then we check if all of them are == 1 in our vector.

element = "this is a message with some data"

# remember our function needs bytes, not strings
indexes = get_idx_from_data(element.encode())

print(indexes)

maybe_found = True

for index in indexes:
    if vector[index] == 0:
        maybe_found = False

if maybe_found is False:
    print("The element is definitely NOT in the set (100% probability)")
else:
    print("The element may be in the set")
[9, 6, 2]
The element may be in the set
element = "another message"

# remember our function needs bytes, not strings
indexes = get_idx_from_data(element.encode())

print(indexes)

maybe_found = True

for index in indexes:
    if vector[index] == 0:
        maybe_found = False

if maybe_found is False:
    print("The element is definitely NOT in the set (100% probability)")
else:
    print("The element may be in the set")
[5, 4, 5]
The element is definitely NOT in the set (100% probability)

Improving the code

Now that we know how a bloom filter works, we can build a better abstraction around it.

from typing import Callable


class BloomFilter:
    def __init__(self, vector_length: int, hash_functions: List[Callable]):
        self.vector_length = vector_length
        self.modulo = vector_length
        self.vector = [0 for _ in range(self.vector_length)]
        self.hash_functions = hash_functions

    def add(self, data: bytes):
        hash_values = []

        for func in self.hash_functions:
            hash_values.append(func(data))

        indexes = [h % self.modulo for h in hash_values]

        for idx in indexes:
            self.vector[idx] = 1

        return indexes

    def query(self, data: bytes) -> bool:
        """
        This method returns `False` if the element
        is not in the set. Otherwise it returns
        `True` (the element may or may not be in the set)
        """

        hash_values = []

        for func in self.hash_functions:
            hash_values.append(func(data))

        indexes = [h % self.modulo for h in hash_values]

        for idx in indexes:
            if self.vector[idx] == 0:
                return False

        return True
bloom = BloomFilter(vector_length=10, hash_functions=[sha1_int, md5_int, sha256_int])
bloom.add("hello bloom".encode())
[0, 8, 7]
bloom.query("hello bloom".encode())
True
bloom.query("hello cuckoo".encode())
False
bloom.vector
[1, 0, 0, 0, 0, 0, 0, 1, 1, 0]

Fantastic! Now we have our bloom filter and we can build multiple filters one using different vector lenghts of a different number/type of hash functions. But we are not done. We have seen that when the bloom.query() method returns True the element may or may not be in the set. If the element was not in the set, it is called a false positive (the filter returned True but the element is not in the set). Bloom filters never have false negatives, if it says the element is not in the set, it’s definitely not there. But the opposite is not true. Another problem we have is that if our vector is very short, at some point, all the elements will be 1. If we use a lot of hash functions, our bloom filter will fill up faster too.

We can calculate the probability of false positives using this formula:

$$P_{F P}=\left(1-P_{0}\right)^{k}=\left(1-\left(1-\frac{1}{m}\right)^{k n}\right)^{k}$$

We can implement the formula in Python

def proba_false_positive(set_size, vector_length, num_hash_functions):

    n = set_size
    m = vector_length
    k = num_hash_functions

    return (1 - (1 - (1 / m)) ** (k * n)) ** k

With that formula, for a set of size n, we can derive the number k of hashing functions and the length m of a vector to use if we want to have a false positive probability \(P_{F P}\).

$$k = -\frac{\ln{P_{FP}}}{\ln{2}}$$

$$m = -\frac{n\ln{P_{FP}}}{(\ln2)^2}$$

For the explanation of how to get these formulas, this blog post explains it with great detail.

We can also implement those last 2 formulas in Python.

import math
def optimal_values(set_size, proba_false_positives):
    n = set_size
    ln_pfp = math.log(proba_false_positives)
    ln_2 = math.log(2)

    optimal_num_hash_funcs = -(ln_pfp / ln_2)
    optimal_vector_legth = -(n * ln_pfp / ln_2**2)

    # we will use math.ceil to approximate the values to the next integer
    return math.ceil(optimal_num_hash_funcs), math.ceil(optimal_vector_legth)
set_size=4_000_000
proba_false_positives=0.001

k, m = optimal_values(set_size=set_size, proba_false_positives=proba_false_positive)

print(f"""
For a set with {set_size} elements, if we want a probability of false positives = {proba_false_positives},
we need a bloom filter with a vector length = {m} and use {k} hash functions.
""".strip())
For a set with 4000000 elements, if we want a probability of false positives = 0.001,
we need a bloom filter with a vector length = 57510351 and use 10 hash functions.

Now we know how to make our bloom filters a bit better. We can add some parameters to our BloomFilter class so that we can tune it to our requirements. But we still have another problem. We can now calculate how many hash functions we need, but so far we have manually given our bloom filter a list of hash functions. We need some way to generate as many hash functions as we need. In this paper the authors prove that we can generate any number of hash functions \(g_{i}(x)\) by just using 2 functions.

$$g_{i}(x)=h_{1}(x)+i h_{2}(x)$$

Where \(i\) goes from 0 to k - 1 (remember k is the number of hash functions that we need to generate).

In Python it would be something like:

def generate_hash_func(n: int) -> Callable:
    h1 = sha1_int
    h2 = md5_int

    def new_func(data: bytes):

        return h1(data) + (n * h2(data))

    return new_func

With everything we have learned, we can now improve our implementation. The updated class will need a set_size (how much data we want to store) and a proba_false_positives (what’s the maximum probability of false positives that we can tolerate). Then we will use the new functions to set the appropriate vector length and generate the necessary hash functions.

Note: I will also add a new method called .add_faster() that is slightly more optimized. The method .add() is easier to understand.

class BloomFilter:
    def __init__(self, set_size, proba_false_positives):

        self.optimal_hash_funcs, self.optimal_vec_len = self.calculate_optim_values(
            set_size, proba_false_positives
        )

        self.vector = [0 for _ in range(self.optimal_vec_len)]

        self.vector_length = len(self.vector)
        self.modulo = self.vector_length

        self.h1 = sha1_int
        self.h2 = md5_int

    def calculate_optim_values(self, set_size, proba_false_positives):
        n = set_size
        ln_pfp = math.log(proba_false_positives)
        ln_2 = math.log(2)

        optimal_num_hash_funcs = -(ln_pfp / ln_2)
        optimal_vector_legth = -(n * ln_pfp / ln_2 ** 2)

        # we will use math.ceil to approximate the values to the next integer
        return math.ceil(optimal_num_hash_funcs), math.ceil(optimal_vector_legth)

    def generate_hash_func(self, n: int) -> Callable:
        """
        Generate a new hash function based
        on the 2 root ones: self.h1() and self.h2()
        """

        def new_func(data: bytes):

            return self.h1(data) + (n * self.h2(data))

        return new_func

    def add(self, data: bytes):
        hash_values = []

        for n in range(self.optimal_hash_funcs):

            hash_func = generate_hash_func(n)

            hash_values.append(hash_func(data))

        indexes = [h % self.modulo for h in hash_values]

        for idx in indexes:
            self.vector[idx] = 1

        return indexes

    def add_faster(self, data: bytes) -> None:

        for n in range(self.optimal_hash_funcs):

            # I will do the modulo operation here to make
            # it a bit faster (avoid generating another list)

            idx = generate_hash_func(n)(data) % self.modulo

            self.vector[idx] = 1

        return

    def query(self, data: bytes) -> bool:
        """
        This method returns `False` if the element
        is not in the set. Otherwise it returns
        `True` (the element may or may not be in the set)
        """

        hash_values = []

        for n in range(self.optimal_hash_funcs):

            hash_func = generate_hash_func(n)

            idx = hash_func(data) % self.modulo
            if self.vector[idx] == 0:
                return False

        return True
bloom = BloomFilter(set_size=1_000_000, proba_false_positives=0.001)
%%timeit
bloom.add("hello bloom".encode())
50.4 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
bloom.add_faster("hello bloom".encode())
54.2 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
bloom.query("hello bloom".encode())
62.1 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
bloom.query("hello cuckoo".encode())
5.66 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)