Running regexes on memory-mapped files
Table of contents
Here are some benchmarks I ran to compare the speed of running a list of regexes on all the Markdown files in my Obsidian folder (210 when I wrote this). The benchmark compares running the regexes on memory-mapped files versus loading the file contents as a string and running the regexes on that string. I’m using hyperfine to run the benchmarks.
Considerations
When we memory-map a file, we work with bytes. Python can run regexes over those bytes too, but the pattern has to be .encode()
‘ed for it to work.
Here’s a Python context manager to memory-map a file:
import mmap
from contextlib import contextmanager
from pathlib import Path
@contextmanager
def mmap_file(file: Path):
with open(file, mode="r", encoding="utf-8") as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
yield mmap_obj
Commented code
import mmap
import re
import sys
from contextlib import contextmanager
from pathlib import Path
# Contains 210 markdown files
OBSIDIAN_FOLDER = Path("<MY iCloud OBSIDIAN FOLDER")
# A bunch of regex patterns
SLUG_PAT = re.compile(
r"""
slug: # word `slug:`
\s+ # one or more spaces
([a-zA-Z0-9\-\_]+) # CAPTURE. Any alphanumeric characters and `-`
\n # newline
""".encode(),
re.VERBOSE,
)
WIKILINK_MEDIA_PAT = re.compile(
r"""
\! # Exclamation mark
\[\[ # Double open square brackets
([\w\_\s\-\.]+) # CAPTURE. Filename: alphanumeric characters, spaces, underscores, hyphens, dots
\]\] # double closing square brackets
""".encode(),
re.VERBOSE,
)
THUMBNAIL_PAT = re.compile(
rf"""
thumbnail: # Word `thumbnail:`
\s+ # One or more spaces
{WIKILINK_MEDIA_PAT.pattern.decode()} # Wikilink media pattern
\n # newline
""".encode(),
re.VERBOSE,
)
FRONTMATTER_PAT = re.compile(r"---(.+?)---".encode(), re.DOTALL)
WIKILINK_DOC_PAT = re.compile(
r"""
(?:[^\!]) # NOT a exclamation mark
\[\[ # double open square brackets
( # CAPTURE
[^\s] # Not a space
.*? # Anything, non-greedy
[^\s] # Not a space
)
\]\] # double closing square brackets
""".encode(),
re.VERBOSE,
)
SELF_POST_LINK_PAT = re.compile(
r"""
[^\!] # NOT exclamation mark
\[ # Start MD link
(?P<link_name>.+?) # link name (named capture group)
\] # close square brackets
\( # link
[\/]?posts\/ # `posts/` or '/posts/'
(?P<link_ref>.+?) # link content (named capture group)
\) # close link
""".encode(),
re.VERBOSE,
)
SELF_MEDIA_LINK_PAT = re.compile(
r"""
\! # exclamation mark
\[ # start link name
(?P<link_name>.*?) # link name (named capture group)
\] # end link name
\( # start link ref
[\/]?img\/s\/ # `img/s/` or '/img/s/'
(?P<link_ref>.+?) # link ref (named capture group)
\) # end link ref
""".encode(),
re.VERBOSE | re.DOTALL,
)
@contextmanager
def mmap_file(file: Path):
with open(file, mode="r", encoding="utf-8") as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_obj:
yield mmap_obj
# benchmark mmap
def run_mmap():
for file in OBSIDIAN_FOLDER.rglob("*.md"):
with mmap_file(file) as mm:
for pattern in [
SLUG_PAT,
WIKILINK_MEDIA_PAT,
THUMBNAIL_PAT,
FRONTMATTER_PAT,
WIKILINK_DOC_PAT,
SELF_POST_LINK_PAT,
SELF_MEDIA_LINK_PAT,
]:
for match in pattern.finditer(mm):
pass
# benchmark reading file as string
def run_read():
for file in OBSIDIAN_FOLDER.rglob("*.md"):
for pattern in [
SLUG_PAT,
WIKILINK_MEDIA_PAT,
THUMBNAIL_PAT,
FRONTMATTER_PAT,
WIKILINK_DOC_PAT,
SELF_POST_LINK_PAT,
SELF_MEDIA_LINK_PAT,
]:
for match in pattern.finditer(file.read_bytes()):
pass
if __name__ == "__main__":
if "mmap" in sys.argv:
run_mmap()
elif "read" in sys.argv:
run_read()
Results
Clearing disk caches before running
Here I’m purging the disk page cache before running each benchmark. This ensures that data is not cached by the OS before each script runs. The --prepare
command will be executed before each timing run.
hyperfine --prepare 'sync && sudo purge' 'python3 bench.py mmap' 'python3 bench.py read' --export-markdown "bench-res-nocache.md"
Benchmark 1: python3 bench.py mmap
Time (mean ± σ): 228.7 ms ± 47.7 ms [User: 28.7 ms, System: 35.5 ms]
Range (min … max): 177.7 ms … 350.6 ms 10 runs
Benchmark 2: python3 bench.py read
Time (mean ± σ): 263.6 ms ± 56.8 ms [User: 32.0 ms, System: 44.7 ms]
Range (min … max): 186.1 ms … 371.6 ms 10 runs
Summary
'python3 bench.py mmap' ran
1.15 ± 0.35 times faster than 'python3 bench.py read'
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
python3 bench.py mmap | 228.7 ± 47.7 | 177.7 | 350.6 | 1.00 |
python3 bench.py read | 263.6 ± 56.8 | 186.1 | 371.6 | 1.15 ± 0.35 |
This benchmark varied more. Generally, the mmap
method was faster, but sometimes it was 1.24 times faster, sometimes 1.01 times faster. I once saw the read
method being faster.
Warming up disk caches before running
Here I’m doing the opposite. Before each scenario is measured, the script runs 3 times. This makes sure the OS can cache the file contents.
hyperfine --warmup 3 'python3 bench.py mmap' 'python3 bench.py read' --export-markdown "bench-res-warm.md"
Benchmark 1: python3 bench.py mmap
Time (mean ± σ): 34.7 ms ± 1.7 ms [User: 26.3 ms, System: 6.5 ms]
Range (min … max): 33.2 ms … 43.9 ms 80 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: python3 bench.py read
Time (mean ± σ): 46.1 ms ± 1.4 ms [User: 29.2 ms, System: 14.8 ms]
Range (min … max): 44.5 ms … 50.9 ms 62 runs
Summary
'python3 bench.py mmap' ran
1.33 ± 0.08 times faster than 'python3 bench.py read'
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
python3 bench.py mmap | 34.5 ± 1.9 | 33.1 | 47.3 | 1.00 |
python3 bench.py read | 45.9 ± 1.2 | 44.4 | 49.0 | 1.33 ± 0.08 |
Conclusion
mmap
seems faster. Memory-mapping the file also has other advantages:
- You won’t run out of memory if the file contents are bigger than your available memory
- If the file is mapped using
flags=mmap.MAP_SHARED
(the default in Python), the memory can be shared across processes. That can be a big performance boost if multiple processes are reading the same files at once.
Extra TIL: ripgrep also uses mmap
.
--mmap
Search using memory maps when possible. This is enabled by default when ripgrep thinks it will be faster.