Running regexes on memory-mapped files

Here are some benchmarks I ran to compare the speed of running a list of regexes on all the Markdown files in my Obsidian folder (210 when I wrote this). The benchmark compares running the regexes on memory-mapped files versus loading the file contents as a string and running the regexes on that string. I’m using hyperfine to run the benchmarks.

Considerations

When we memory-map a file, we work with bytes. Python can run regexes over those bytes too, but the pattern has to be .encode()‘ed for it to work.

Here’s a Python context manager to memory-map a file:

import mmap
from contextlib import contextmanager
from pathlib import Path


@contextmanager
def mmap_file(file: Path):
    with open(file, mode="r", encoding="utf-8") as f:
        with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
            yield mmap_obj

Commented code

import mmap
import re
import sys
from contextlib import contextmanager
from pathlib import Path

# Contains 210 markdown files
OBSIDIAN_FOLDER = Path("<MY iCloud OBSIDIAN FOLDER")

# A bunch of regex patterns
SLUG_PAT = re.compile(
    r"""
    slug:              # word `slug:`
    \s+                # one or more spaces
    ([a-zA-Z0-9\-\_]+) # CAPTURE. Any alphanumeric characters and `-`
    \n                 # newline
    """.encode(),
    re.VERBOSE,
)

WIKILINK_MEDIA_PAT = re.compile(
    r"""
    \!               # Exclamation mark
    \[\[             # Double open square brackets
    ([\w\_\s\-\.]+)  # CAPTURE. Filename: alphanumeric characters, spaces, underscores, hyphens, dots
    \]\]             # double closing square brackets
    """.encode(),
    re.VERBOSE,
)

THUMBNAIL_PAT = re.compile(
    rf"""
    thumbnail:                            # Word `thumbnail:`
    \s+                                   # One or more spaces
    {WIKILINK_MEDIA_PAT.pattern.decode()} # Wikilink media pattern
    \n                                    # newline
    """.encode(),
    re.VERBOSE,
)

FRONTMATTER_PAT = re.compile(r"---(.+?)---".encode(), re.DOTALL)


WIKILINK_DOC_PAT = re.compile(
    r"""
        (?:[^\!])   # NOT a exclamation mark
        \[\[    # double open square brackets
        (       # CAPTURE
          [^\s] # Not a space
          .*?   # Anything, non-greedy
          [^\s] # Not a space
        )
        \]\]    # double closing square brackets
    """.encode(),
    re.VERBOSE,
)


SELF_POST_LINK_PAT = re.compile(
    r"""
    [^\!]               # NOT exclamation mark
    \[                  # Start MD link
    (?P<link_name>.+?)  # link name (named capture group)
    \]                  # close square brackets
    \(                  # link
    [\/]?posts\/        # `posts/` or '/posts/'
    (?P<link_ref>.+?)   # link content (named capture group)
    \)                  # close link
    """.encode(),
    re.VERBOSE,
)

SELF_MEDIA_LINK_PAT = re.compile(
    r"""
    \!                  # exclamation mark
    \[                  # start link name
    (?P<link_name>.*?)  # link name (named capture group)
    \]                  # end link name
    \(                  # start link ref
    [\/]?img\/s\/        # `img/s/` or '/img/s/'
    (?P<link_ref>.+?)   # link ref (named capture group)
    \)                  # end link ref
    """.encode(),
    re.VERBOSE | re.DOTALL,
)


@contextmanager
def mmap_file(file: Path):
    with open(file, mode="r", encoding="utf-8") as f:
        with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
            yield mmap_obj

# benchmark mmap
def run_mmap():
    for file in OBSIDIAN_FOLDER.rglob("*.md"):
        with mmap_file(file) as mm:
            for pattern in [
                SLUG_PAT,
                WIKILINK_MEDIA_PAT,
                THUMBNAIL_PAT,
                FRONTMATTER_PAT,
                WIKILINK_DOC_PAT,
                SELF_POST_LINK_PAT,
                SELF_MEDIA_LINK_PAT,
            ]:
                for match in pattern.finditer(mm):
                    pass

# benchmark reading file as string
def run_read():
    for file in OBSIDIAN_FOLDER.rglob("*.md"):
        for pattern in [
            SLUG_PAT,
            WIKILINK_MEDIA_PAT,
            THUMBNAIL_PAT,
            FRONTMATTER_PAT,
            WIKILINK_DOC_PAT,
            SELF_POST_LINK_PAT,
            SELF_MEDIA_LINK_PAT,
        ]:
            for match in pattern.finditer(file.read_bytes()):
                pass


if __name__ == "__main__":
    if "mmap" in sys.argv:
        run_mmap()
    elif "read" in sys.argv:
        run_read()

Results

Clearing disk caches before running

Here I’m purging the disk page cache before running each benchmark. This ensures that data is not cached by the OS before each script runs. The --prepare command will be executed before each timing run.

hyperfine --prepare 'sync && sudo purge' 'python3 bench.py mmap' 'python3 bench.py read' --export-markdown "bench-res-nocache.md"

Benchmark 1: python3 bench.py mmap
  Time (mean ± σ):     228.7 ms ±  47.7 ms    [User: 28.7 ms, System: 35.5 ms]
  Range (min … max):   177.7 ms … 350.6 ms    10 runs

Benchmark 2: python3 bench.py read
  Time (mean ± σ):     263.6 ms ±  56.8 ms    [User: 32.0 ms, System: 44.7 ms]
  Range (min … max):   186.1 ms … 371.6 ms    10 runs

Summary
  'python3 bench.py mmap' ran
    1.15 ± 0.35 times faster than 'python3 bench.py read'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`python3 bench.py mmap`	228.7 ± 47.7	177.7	350.6	1.00
`python3 bench.py read`	263.6 ± 56.8	186.1	371.6	1.15 ± 0.35

This benchmark varied more. Generally, the mmap method was faster, but sometimes it was 1.24 times faster, sometimes 1.01 times faster. I once saw the read method being faster.

Warming up disk caches before running

Here I’m doing the opposite. Before each scenario is measured, the script runs 3 times. This makes sure the OS can cache the file contents.

hyperfine --warmup 3 'python3 bench.py mmap' 'python3 bench.py read' --export-markdown "bench-res-warm.md"

Benchmark 1: python3 bench.py mmap
  Time (mean ± σ):      34.7 ms ±   1.7 ms    [User: 26.3 ms, System: 6.5 ms]
  Range (min … max):    33.2 ms …  43.9 ms    80 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: python3 bench.py read
  Time (mean ± σ):      46.1 ms ±   1.4 ms    [User: 29.2 ms, System: 14.8 ms]
  Range (min … max):    44.5 ms …  50.9 ms    62 runs

Summary
  'python3 bench.py mmap' ran
    1.33 ± 0.08 times faster than 'python3 bench.py read'

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`python3 bench.py mmap`	34.5 ± 1.9	33.1	47.3	1.00
`python3 bench.py read`	45.9 ± 1.2	44.4	49.0	1.33 ± 0.08

Conclusion

mmap seems faster. Memory-mapping the file also has other advantages:

You won’t run out of memory if the file contents are bigger than your available memory
If the file is mapped using flags=mmap.MAP_SHARED (the default in Python), the memory can be shared across processes. That can be a big performance boost if multiple processes are reading the same files at once.

Extra TIL: ripgrep also uses mmap.

--mmap

Search using memory maps when possible. This is enabled by default when ripgrep
thinks it will be faster.

rand[om]