rand[om]

med ∩ ml

Peeking and backtracking Python generators

Python generators are mighty, but they lack a couple of useful features. One of them is peeking the next item without consuming the generator. Even better, what if we could peek any number of items? Another feature lacking from generators is rewinding/backtracking. We will implement both of those features in a couple of different ways. Note: if you need a better tested implementation that lets you peek 1 item, check out more_itertools.peekable.

We will do 2 different implementations, one using the itertools.tee function and another one using collections.deque.

itertools.tee

This function can take an iterator as input and will return n copies of the iterator. But now each one of those copies can advance (next()) independently. It’s crucial that we don’t advance the original iterator used as input for the tee function.

The following Python class does:

  • Split the original generator in 2 copies:
    • root_gen, used as the main generator
    • original, used for rewinding the generator
  • The peek function will make a copy of root_gen, we will peek (and consume) only one of the generators. Then we can use the other copy to replace root_gen. Now root_gen is at the same state as before.
  • The peekn(n) function is the same as peek. This method uses the itertools.islice function to take the next n items from an iterator. Then, as before, we replace root_gen with the copy of our generator that we didn’t advance.
  • get_range(). This method will return a tuple with a range of the generator, even if that range was already consumed. This is why we created the original generator. We will use the same technique as before(”backup” the generator using itertools.islice, consume what we need, and finally replace our original generator with our “backup”). Notice that when we created our class, we also stored a self.consumed attribute. This is a number we can increase every time we .consume() a value so that we know at which position we are. We can then use it to get a range of values relative to our current position in the generator. The .consume() method will start generating values from the start. This can be a performance issue if generating the values is expensive. It will advance the iterator until the start of the range we want. Then it will use itertools.islice again to retrieve n items from our generator.
  • .consume() just consumes the root_gen normally and returns the next value.
import itertools
from typing import Generator, Tuple, Optional

IntGenerator = Generator[int, None, None]

class IntStream:
    def __init__(self, gen: IntGenerator):
        self.root_gen, self.original = itertools.tee(gen, 2)
        self.consumed = 0

    def peek(self) -> int:
        current, backup = itertools.tee(self.root_gen, 2)
        try:
            peeked = next(current)
        except StopIteration:
            raise StopIteration(
                "Can't peek more values, the generator has been fully consumed."
            )

        # replace self.root_gen with the generator copy that we didn't
        # consume
        self.root_gen = backup
        return peeked

    def peekn(self, n=1) -> Tuple[int, ...]:
        current, backup = itertools.tee(self.root_gen, 2)
        peeked = tuple(itertools.islice(current, n))
        self.root_gen = backup
        return peeked

def get_range(self, pos_from, n: Optional[int] = None) -> Tuple[int, ...]:
        current, backup = itertools.tee(self.original, 2)
        self.original = backup
        # advance iterator to wanted starting position - 1
				# to do it, we will just advance the iterator ignoring
				# ignoring the returned values
        for value in itertools.islice(current, pos_from - 1):
            continue

        return tuple(current) if not n else tuple(itertools.islice(current, n))

    def consume(self) -> int:
        self.consumed += 1
        return next(self.root_gen)

    def __repr__(self) -> str:
        return f"IntStream(\n{self.peekn(10)}\n...)"

To put it into action. First, we will make an example generator:

g = (num for num in range(100))

Now we have a generator g, let’s wrap it into a custom class that will let us .peek(), .peekn(n) and .consume() the generator.

s = IntStream(g)

print(s.peek())  # Will print: 0
print(s.peek())  # Will print: 0
print(s.consume())  # Will print: 0
print(s.peek())  # Will print: 1
print(s.consume())  # Will print: 1

# Peek multiple values
print(s.peekn(10))  # Will print: (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

print(s.consume())  # Will print: 2

Now we will consume some values and get a range from a position we have already consumed:

# Consume the next 20 values
for _ in range(20):
    s.consume()

print(s.peek())  # Will print: 23

current_position = s.consumed

print(current_position)  # Will print: 23

# Will print (18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
print(s.get_range(pos_from=current_position - 4, n=10))

collections.deque

The tl;dr of this version is that we will store “peeked” values. Then when we want to .consume() a new value, we will first return from the peeked ones, then from the original generator.

We are using a deque instead of a list because we want to append values to the right of the queue as we peek them, but then we need to pop values from the left of the queue when .consume()‘ing them. Some implementation details:

  • get_range is the same as before
  • peekn will first check if we have enough values in the self.peeked queue before consuming the original generator.
import itertools
from typing import Generator, Tuple, Optional
from collections import deque

g = (num for num in range(100))

IntGenerator = Generator[int, None, None]


class IntStream:
    def __init__(self, gen: IntGenerator):
        self.root_gen, self.original = itertools.tee(gen, 2)
        self.consumed = 0
        self.peeked = deque()

    def peek(self) -> int:
        if self.peeked:
            return self.peeked[0]

        try:
            v = next(self.root_gen)
        except StopIteration:
            raise StopIteration(
                "Can't peek more values, the generator has been fully consumed."
            )

        # store the value just peeked
        self.peeked.append(v)
        return v

    def peekn(self, n=1) -> Tuple[int, ...]:
        num_already_peeked_values = len(self.peeked)
        if n < num_already_peeked_values:
            return self.peeked[:n]

        num_not_peeked = n - num_already_peeked_values
        next_values = tuple(itertools.islice(self.root_gen, n))
        self.peeked.extend(next_values)
        return tuple(self.peeked)

    def get_range(self, pos_from, n: Optional[int] = None) -> Tuple[int, ...]:
        current, backup = itertools.tee(self.original, 2)
        self.original = backup
        # advance iterator to wanted starting position - 1
        for value in itertools.islice(current, pos_from - 1):
            continue

        return tuple(current) if not n else tuple(itertools.islice(current, n))

    def consume(self) -> int:
        self.consumed += 1
        if self.peeked:
            return self.peeked.popleft()
        return next(self.root_gen)

    def __repr__(self) -> str:
        return f"IntStream(\n{self.peekn(10)}\n...)"

We can use the same code as before to test it:

s = IntStream(g)

print(s.peek())  # Will print: 0
print(s.peek())  # Will print: 0
print(s.consume())  # Will print: 0
print(s.peek())  # Will print: 1
print(s.consume())  # Will print: 1

# Peek multiple values
print(s.peekn(10))  # Will print: (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

print(s.consume())  # Will print: 2

# Consume the next 20 values
for _ in range(20):
    s.consume()

print(s.peek())  # Will print: 23

current_position = s.consumed

print(current_position)  # Will print: 23


# Will print (18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
print(s.get_range(pos_from=current_position - 4, n=10))

Some real-world use cases

I got these ideas of implementing a “peekable” iterator while taking David Beazley’s Write a Compiler course (if you’re interested in compilers, I highly recommend that course).

When iterating over a stream of tokens, you may want to know if the next thing to parse is a variable assignment. In this case, the .peekn() method could be useful to check if the next 2 tokens are a NAME + = sign (e.g: num = ...). Using a single .peek() may not be enough because a NAME token could also be the beginning of a function call.

In the compilers case, the peek_range method could be helpful to get some information for error messages. I was mostly using it for debugging purposes, when my parser failed it was useful to know the previous and next tokens in my iterator.