py-14-regex-engine

1.000

19/19 tests· architecture

Challenge · difficulty 5/5

# Backtracking regex matcher

Implement a file **`solution.py`** containing a function `fullmatch` that decides
whether a regular expression matches a string. You must implement the matching
yourself with a **backtracking** algorithm — **do not** call Python's `re` module.

```python
def fullmatch(pattern: str, text: str) -> bool:
    """Return True iff `pattern` matches the ENTIRE `text`."""
```

`fullmatch` is **anchored**: the pattern must consume the entire `text`, as if it
were wrapped in an implicit `^...$`. Matching only a prefix of `text` is **not**
a match.

## Supported syntax

An *element* is the smallest matchable unit. It is one of:

- A **literal character** — matches exactly itself (e.g. `a` matches `"a"`).
- `.` — matches **any single character**.
- An **escaped character** `\x` — the backslash strips any special meaning from
  the next character, which then matches **literally**. So `\.` matches a literal
  `.`, `\*` a literal `*`, `\\` a literal backslash, `\[` a literal `[`. Escaping
  an ordinary character (e.g. `\a`) just matches that character.
- A **character class** `[...]`:
  - `[abc]` matches any **one** of the listed characters.
  - `[^abc]` matches any one character **not** listed (a *negated* class). The `^`
    is special only as the **first** character inside the brackets.
  - **Ranges** like `[a-z]`, `[0-9]`, `[A-Za-z0-9]` match any character whose code
    point falls in the (inclusive) range.
  - A `-` that appears **first or last** inside the class (right after `[` or
    `[^`, or right before `]`) is a **literal** `-`, not a range.
  - A class always matches **exactly one** character of `text` (so a class can
    never match against the empty string at a position).

## Quantifiers

A quantifier applies to the **single preceding element**:

- `*` — **zero or more** of the preceding element.
- `+` — **one or more** of the preceding element.
- `?` — **zero or one** of the preceding element.

Quantifiers are **greedy**: they consume as many repetitions as possible, but they
**must backtrack** — giving repetitions back one at a time — so that the rest of
the pattern can still match and the **whole** text can be consumed.

A quantifier always binds to the whole element immediately to its left, where an
element is a literal char, an escaped char `\x`, `.`, or an entire `[...]` class.
For example, in `a[0-9]*b` the `*` applies to the class `[0-9]`, and in `\.*` the
`*` applies to the escaped literal dot.

## Not supported

There are **no** groups `(...)` and **no** alternation `|`. You do not need to
handle them.

## Semantics / edge cases

- The **empty pattern** matches **only** the empty string.
- `a*` matches `""` (zero repetitions). `a+` does **not** match `""`.
- Greedy backtracking: `a*a` matches `"aaa"` — the `*` first grabs all three
  `a`s, then gives one back so the trailing `a` can match.
- `.*` matches **anything** (including `""`); `.+` matches any non-empty string.
- Anchoring: `a` does **not** match `"ab"` (trailing `b` unconsumed), and `b` does
  not match `"ab"` (leading `a` unconsumed).

## Worked example

```python
assert fullmatch("a*a", "aaa") is True      # '*' backtracks, gives back one 'a'
assert fullmatch("a+", "") is False         # '+' needs at least one
assert fullmatch("a?b", "b") is True        # '?' matches zero
assert fullmatch("[a-z]+[0-9]*", "abc12") is True
assert fullmatch("[^0-9]+", "abc") is True  # negated class
assert fullmatch("h\\.t", "h.t") is True    # escaped dot is literal
assert fullmatch("h.t", "hat") is True      # '.' matches any char
assert fullmatch("a.c", "ac") is False      # '.' needs one char
assert fullmatch("colou?r", "color") is True
assert fullmatch("", "") is True
assert fullmatch("", "x") is False
```

tests/test_regex_engine.py

from solution import fullmatch


def test_empty_pattern_matches_only_empty():
    assert fullmatch("", "") is True
    assert fullmatch("", "x") is False
    assert fullmatch("", "abc") is False


def test_plain_literals():
    assert fullmatch("abc", "abc") is True
    assert fullmatch("abc", "abd") is False
    assert fullmatch("a", "a") is True
    assert fullmatch("a", "") is False


def test_anchoring_requires_full_consume():
    # leading and trailing extra chars must cause failure
    assert fullmatch("a", "ab") is False
    assert fullmatch("b", "ab") is False
    assert fullmatch("ab", "abc") is False
    assert fullmatch("bc", "abc") is False
    assert fullmatch("abc", "ab") is False


def test_dot_matches_any_single_char():
    assert fullmatch("h.t", "hat") is True
    assert fullmatch("h.t", "h9t") is True
    assert fullmatch(".", "x") is True
    assert fullmatch(".", "") is False     # needs exactly one char
    assert fullmatch("a.c", "ac") is False  # dot must consume one
    assert fullmatch("...", "abc") is True
    assert fullmatch("...", "ab") is False


def test_star_zero_or_more():
    assert fullmatch("a*", "") is True
    assert fullmatch("a*", "a") is True
    assert fullmatch("a*", "aaaa") is True
    assert fullmatch("a*", "aaab") is False
    assert fullmatch("ba*", "b") is True
    assert fullmatch("ba*c", "bc") is True
    assert fullmatch("ba*c", "baaac") is True


def test_plus_one_or_more():
    assert fullmatch("a+", "") is False
    assert fullmatch("a+", "a") is True
    assert fullmatch("a+", "aaa") is True
    assert fullmatch("a+", "aaab") is False
    assert fullmatch("ba+c", "bac") is True
    assert fullmatch("ba+c", "bc") is False


def test_question_zero_or_one():
    assert fullmatch("a?", "") is True
    assert fullmatch("a?", "a") is True
    assert fullmatch("a?", "aa") is False
    assert fullmatch("colou?r", "color") is True
    assert fullmatch("colou?r", "colour") is True
    assert fullmatch("colou?r", "colouur") is False
    assert fullmatch("a?b", "b") is True
    assert fullmatch("a?b", "ab") is True


def test_greedy_backtracking_star():
    # '*' must give characters back so the trailing 'a' can match
    assert fullmatch("a*a", "aaa") is True
    assert fullmatch("a*a", "a") is True
    assert fullmatch("a*a", "") is False     # need at least one 'a'
    assert fullmatch("a*ab", "aaab") is True
    assert fullmatch("a*ab", "ab") is True


def test_dotstar_matches_anything():
    assert fullmatch(".*", "") is True
    assert fullmatch(".*", "anything at all 123 !@#") is True
    assert fullmatch(".+", "") is False
    assert fullmatch(".+", "x") is True
    # backtracking through dot-star
    assert fullmatch(".*b", "aaabxb") is True
    assert fullmatch(".*b", "aaabx") is False


def test_dotstar_backtracking_with_suffix():
    assert fullmatch("a.*z", "az") is True
    assert fullmatch("a.*z", "abcz") is True
    assert fullmatch("a.*z", "abc") is False
    assert fullmatch("x.*y.*z", "xyz") is True
    assert fullmatch("x.*y.*z", "x111y222z") is True


def test_char_class_basic():
    assert fullmatch("[abc]", "a") is True
    assert fullmatch("[abc]", "b") is True
    assert fullmatch("[abc]", "c") is True
    assert fullmatch("[abc]", "d") is False
    assert fullmatch("[abc]", "") is False
    assert fullmatch("[abc]", "ab") is False  # class matches exactly one


def test_char_class_ranges():
    assert fullmatch("[a-z]", "m") is True
    assert fullmatch("[a-z]", "M") is False
    assert fullmatch("[0-9]", "7") is True
    assert fullmatch("[0-9]", "a") is False
    assert fullmatch("[A-Za-z0-9]", "Q") is True
    assert fullmatch("[A-Za-z0-9]", "q") is True
    assert fullmatch("[A-Za-z0-9]", "5") is True
    assert fullmatch("[A-Za-z0-9]", "_") is False


def test_char_class_with_quantifiers():
    assert fullmatch("[a-z]+", "hello") is True
    assert fullmatch("[a-z]+", "Hello") is False
    assert fullmatch("[a-z]+[0-9]*", "abc12") is True
    assert fullmatch("[a-z]+[0-9]*", "abc") is True
    assert fullmatch("[a-z]+[0-9]*", "123") is False
    assert fullmatch("[0-9]*", "") is True


def test_negated_char_class():
    assert fullmatch("[^0-9]", "a") is True
    assert fullmatch("[^0-9]", "5") is False
    assert fullmatch("[^0-9]+", "abc") is True
    assert fullmatch("[^0-9]+", "ab3c") is False
    assert fullmatch("[^abc]", "d") is True
    assert fullmatch("[^abc]", "a") is False


def test_literal_dash_in_class():
    # '-' first or last in the class is a literal dash
    assert fullmatch("[-a]", "-") is True
    assert fullmatch("[-a]", "a") is True
    assert fullmatch("[a-]", "-") is True
    assert fullmatch("[a-]", "a") is True
    assert fullmatch("[a-]", "b") is False
    # literal dash in a negated class
    assert fullmatch("[^-]", "-") is False
    assert fullmatch("[^-]", "x") is True


def test_escaped_metacharacters():
    assert fullmatch("h\\.t", "h.t") is True
    assert fullmatch("h\\.t", "hat") is False   # escaped dot is literal
    assert fullmatch("a\\*b", "a*b") is True
    assert fullmatch("a\\*b", "aab") is False
    assert fullmatch("a\\\\b", "a\\b") is True   # \\ -> literal backslash
    assert fullmatch("\\[x\\]", "[x]") is True
    assert fullmatch("a\\+", "a+") is True
    assert fullmatch("a\\?", "a?") is True


def test_escaped_element_with_quantifier():
    # quantifier applies to the escaped element
    assert fullmatch("\\.*", "") is True
    assert fullmatch("\\.*", "...") is True
    assert fullmatch("\\.*", ".a.") is False
    assert fullmatch("\\*+", "***") is True
    assert fullmatch("\\*+", "") is False


def test_combined_realistic_patterns():
    # identifier-ish: letter then letters/digits/underscore-via-class
    assert fullmatch("[a-z][a-z0-9]*", "x") is True
    assert fullmatch("[a-z][a-z0-9]*", "var123") is True
    assert fullmatch("[a-z][a-z0-9]*", "1var") is False
    # crude decimal: digits . digits
    assert fullmatch("[0-9]+\\.[0-9]+", "3.14") is True
    assert fullmatch("[0-9]+\\.[0-9]+", "3.") is False
    assert fullmatch("[0-9]+\\.[0-9]+", ".14") is False
    assert fullmatch("[0-9]+\\.[0-9]+", "10.005") is True


def test_clear_non_matches():
    assert fullmatch("abc", "") is False
    assert fullmatch("a+b+", "aaa") is False
    assert fullmatch("[xyz]+", "xya") is False
    assert fullmatch("a.c.e", "abcd") is False
    assert fullmatch("a*b*c*", "cba") is False

System prompt

You are an expert programmer. Solve the task exactly as specified. Output your solution as fenced code blocks using the required file name(s) and the exact function/type signatures requested. Prefer correctness; do not include prose outside code unless asked.

Proposed solution

I'll implement a backtracking regex matcher that supports the specified syntax. Let me break this down step by step.

```python
def fullmatch(pattern: str, text: str) -> bool:
    """Return True iff `pattern` matches the ENTIRE `text`."""
    # Convert pattern to tokens for easier processing
    try:
        tokens = parse_pattern(pattern)
    except:
        return False
    
    # Use backtracking to match
    return match_tokens(tokens, 0, text, 0)


def parse_pattern(pattern: str):
    """Parse pattern into a list of (element, quantifier) tuples."""
    if not pattern:
        return []
    
    tokens = []
    i = 0
    n = len(pattern)
    
    while i < n:
        # Check for quantifiers that apply to previous element
        if pattern[i] in '*+?':
            if not tokens:
                raise ValueError("Quantifier without preceding element")
            # Apply quantifier to last token
            last_elem, last_q = tokens[-1]
            tokens[-1] = (last_elem, pattern[i])
            i += 1
            continue
        
        # Parse an element
        if pattern[i] == '.':
            element = ('dot',)
            i += 1
        elif pattern[i] == '\\':
            if i + 1 >= n:
                raise ValueError("Trailing backslash")
            # Escaped character
            element = ('literal', pattern[i+1])
            i += 2
        elif pattern[i] == '[':
            # Character class
            j = i + 1
            negated = False
            if j < n and pattern[j] == '^':
                negated = True
                j += 1
            
            # Find closing bracket
            while j < n and pattern[j] != ']':
                j += 1
            
            if j >= n:
                raise ValueError("Unclosed character class")
            
            class_content = pattern[i+1:j]
            element = ('class', negated, parse_class(class_content))
            i = j + 1
        else:
            # Regular literal character
            element = ('literal', pattern[i])
            i += 1
        
        # Default quantifier is None (exactly one)
        tokens.append((element, None))
    
    return tokens


def parse_class(class_content: str):
    """Parse the content of a character class into a list of ranges."""
    ranges = []
    i = 0
    n = len(class_content)
    
    while i < n:
        # Check if this is a range (not first or last char, and not after '^')
        if i + 2 < n and class_content[i+1] == '-' and class_content[i+2] != ']':
            # This is a range: [start]-[end]
            start = class_content[i]
            end = class_content[i+2]
            ranges.append((ord(start), ord(end)))
            i += 3
        else:
            # Single character
            char = class_content[i]
            ranges.append((ord(char), ord(char)))
            i += 1
    
    return ranges


def match_tokens(tokens, token_idx, text, text_idx):
    """Backtracking matcher for tokens."""
    # Base case: no more tokens to match
    if token_idx == len(tokens):
        return text_idx == len(text)
    
    element, quantifier = tokens[token_idx]
    
    # Handle quantifiers
    if quantifier == '*':
        # Zero or more - try matching zero first (greedy means we'll try more later)
        # Try matching zero occurrences first
        if match_tokens(tokens, token_idx + 1, text, text_idx):
            return True
        
        # Try matching one or more (greedy: try as many as possible first)
        max_matches = len(text) - text_idx
        for count in range(1, max_matches + 1):
            if not matches_element(element, text, text_idx, count):
                break  # Can't match count times, so can't match more
            
            # Try matching the rest of the pattern after consuming count characters
            if match_tokens(tokens, token_idx + 1, text, text_idx + count):
                return True
        
        return False
    
    elif quantifier == '+':
        # One or more - must match at least once
        max_matches = len(text) - text_idx
        for count in range(1, max_matches + 1):
            if not matches_element(element, text, text_idx, count):
                break
            
            # Try matching the rest of the pattern after consuming count characters
            if match_tokens(tokens, token_idx + 1, text, text_idx + count):
                return True
        
        return False
    
    elif quantifier == '?':
        # Zero or one - try zero first (greedy means we'll try one if zero fails)
        # Try matching zero occurrences
        if match_tokens(tokens, token_idx + 1, text, text_idx):
            return True
        
        # Try matching one occurrence
        if matches_element(element, text, text_idx, 1):
            return match_tokens(tokens, token_idx + 1, text, text_idx + 1)
        
        return False
    
    else:  # No quantifier (exactly one)
        if matches_element(element, text, text_idx, 1):
            return match_tokens(tokens, token_idx + 1, text, text_idx + 1)
        return False


def matches_element(element, text, text_idx, count):
    """Check if element matches count characters starting at text_idx."""
    if text_idx + count > len(text):
        return False
    
    if element[0] == 'dot':
        # . matches any character
        return count > 0
    
    elif element[0] == 'literal':
        # Match literal character
        target = element[1]
        for i in range(count):
            if text_idx + i >= len(text) or text[text_idx + i] != target:
                return False
        return True
    
    elif element[0] == 'class':
        negated, ranges = element[1], element[2]
        for i in range(count):
            if text_idx + i >= len(text):
                return False
            
            char_code = ord(text[text_idx + i])
            # Check if character is in the class
            in_class = any(start <= char_code <= end for start, end in ranges)
            
            if negated:
                if in_class:
                    return False
            else:
                if not in_class:
                    return False
        
        return True
    
    return False
```

Test output (stdout)

...................                                                      [100%]
19 passed in 0.04s