algo-string-01

1.000

14/14 tests· algorithms

Challenge · difficulty 4/5

# Longest k-repeated substring

Implement a file **`solution.py`** containing a function `longest_k_repeated`:

```python
def longest_k_repeated(s: str, k: int) -> int:
    """Return the length of the longest substring of `s` that occurs
    at least `k` times."""
```

Given a string `s` and an integer `k >= 1`, return the **length of the longest
non-empty substring** of `s` that occurs **at least `k` times** in `s`. If no
non-empty substring occurs at least `k` times, return `0`.

## What counts as an occurrence

- An occurrence is a **distinct starting position** in `s`. A substring of
  length `L` occurs at position `i` iff `s[i:i+L]` equals it.
- **Overlaps are allowed.** For example, in `"aaaa"` the substring `"aaa"`
  occurs at positions `0` and `1`, so it occurs **2** times.
- The string is compared **exactly**, character by character; matching is
  **case-sensitive** and works over arbitrary Unicode characters.

## Precise definition

Return the largest `L >= 1` such that there exists a string `w` of length `L`
for which the number of indices `i` with `s[i:i+L] == w` is `>= k`. If no such
`L` exists, return `0`.

## Edge cases

- `k == 1`: every non-empty substring occurs at least once, so for a non-empty
  `s` the answer is `len(s)` (the whole string occurs once). For the empty
  string the answer is `0`.
- The empty string returns `0` for **every** `k`.
- If `k` exceeds the length of the longest single-character run and the string
  has no repeats at all (e.g. all-distinct characters), smaller substrings may
  still fail the threshold — return `0` when nothing qualifies.

## Worked examples

```python
assert longest_k_repeated("banana", 2) == 3    # "ana" occurs at 1 and 3 (overlap)
assert longest_k_repeated("banana", 3) == 1    # only single chars occur >= 3 times
assert longest_k_repeated("banana", 4) == 0
assert longest_k_repeated("aaa", 2) == 2       # "aa" at 0 and 1
assert longest_k_repeated("aaa", 3) == 1       # "a" x3
assert longest_k_repeated("abcabc", 1) == 6    # whole string, k==1
assert longest_k_repeated("abcdef", 2) == 0    # all distinct, nothing repeats
assert longest_k_repeated("", 5) == 0
```

## Efficiency

Inputs can be **large**: `len(s)` up to about `100000`. A naive approach that
enumerates every substring is `O(n^2)` in time and memory and will time out.
Aim for roughly `O(n)` or `O(n log n)`. (A suffix automaton, or a suffix array
with LCP, or binary-search-plus-hashing all work.)

## Constraints

- `1 <= k`
- `0 <= len(s) <= 100000`
- `s` consists of arbitrary characters (tests use printable ASCII).

tests/test_longest_k_repeated.py

import random

from solution import longest_k_repeated


def brute(s, k):
    """O(n^2) reference oracle for small strings."""
    n = len(s)
    if n == 0 or k < 1:
        return 0
    for L in range(n, 0, -1):
        seen = {}
        for i in range(n - L + 1):
            sub = s[i:i + L]
            c = seen.get(sub, 0) + 1
            seen[sub] = c
            if c >= k:
                return L
    return 0


def test_empty_string():
    assert longest_k_repeated("", 1) == 0
    assert longest_k_repeated("", 2) == 0
    assert longest_k_repeated("", 5) == 0


def test_k_one_is_whole_string():
    assert longest_k_repeated("a", 1) == 1
    assert longest_k_repeated("abc", 1) == 3
    assert longest_k_repeated("abcabc", 1) == 6


def test_no_repeat_returns_zero():
    # All distinct characters: nothing occurs twice.
    assert longest_k_repeated("abcdef", 2) == 0
    assert longest_k_repeated("a", 2) == 0
    assert longest_k_repeated("xyz", 3) == 0


def test_single_char_runs():
    # "aaa": 'a' x3, 'aa' x2, 'aaa' x1
    assert longest_k_repeated("aaa", 1) == 3
    assert longest_k_repeated("aaa", 2) == 2
    assert longest_k_repeated("aaa", 3) == 1
    assert longest_k_repeated("aaa", 4) == 0


def test_banana():
    # classic: "ana" occurs at positions 1 and 3 (overlapping)
    assert longest_k_repeated("banana", 2) == 3
    # "a" occurs 3 times, "an"/"na" twice, "ana" twice
    assert longest_k_repeated("banana", 3) == 1
    assert longest_k_repeated("banana", 4) == 0


def test_overlapping_counts():
    # "aaaa": "aaa" occurs at 0 and 1 -> length 3 for k=2
    assert longest_k_repeated("aaaa", 2) == 3
    assert longest_k_repeated("aaaa", 3) == 2
    assert longest_k_repeated("aaaa", 4) == 1


def test_disjoint_repeat():
    # "abcXabc": "abc" occurs twice, no overlap
    assert longest_k_repeated("abcXabc", 2) == 3
    assert longest_k_repeated("abcXabc", 3) == 0


def test_mixed_case_sensitive():
    # 'A' and 'a' are different characters.
    # "AaAa": "Aa" occurs at 0 and 2 (len 2); "AaA" occurs once, "aAa" once.
    assert longest_k_repeated("AaAa", 2) == 2


def test_period_two_medium():
    s = "ab" * 50
    n = len(s)
    # periodic with period 2: s[0..n-3] == s[2..n-1]
    assert longest_k_repeated(s, 2) == n - 2


def test_matches_brute_small_random():
    rng = random.Random(1234)
    for _ in range(400):
        n = rng.randint(0, 12)
        alpha = "ab" if rng.random() < 0.5 else "abc"
        s = "".join(rng.choice(alpha) for _ in range(n))
        for k in range(1, 6):
            assert longest_k_repeated(s, k) == brute(s, k), (s, k)


def test_matches_brute_larger_alphabet():
    rng = random.Random(99)
    for _ in range(150):
        n = rng.randint(0, 40)
        s = "".join(rng.choice("abcde") for _ in range(n))
        for k in range(1, 4):
            assert longest_k_repeated(s, k) == brute(s, k), (s, k)


def test_large_all_same():
    s = "a" * 100000
    assert longest_k_repeated(s, 2) == 99999
    assert longest_k_repeated(s, 100000) == 1
    assert longest_k_repeated(s, 100001) == 0


def test_large_period_two():
    s = "ab" * 50000
    n = len(s)  # 100000
    assert longest_k_repeated(s, 2) == n - 2


def test_large_no_long_repeat():
    # Random over a 10-char alphabet: the longest repeat must be short.
    rng = random.Random(7)
    s = "".join(rng.choice("abcdefghij") for _ in range(60000))
    ans = longest_k_repeated(s, 2)
    # There must be some repeat (pigeonhole on length-1 substrings), and it
    # should be far shorter than the whole string.
    assert 1 <= ans < 1000

System prompt

You are an expert programmer. Solve the task exactly as specified. Output your solution as fenced code blocks using the required file name(s) and the exact function/type signatures requested. Prefer correctness; do not include prose outside code unless asked.

Proposed solution

```python
def longest_k_repeated(s: str, k: int) -> int:
    """Return the length of the longest substring of `s` that occurs
    at least `k` times."""
    if k == 1:
        return len(s)
    if len(s) == 0:
        return 0
    
    def check(length: int) -> bool:
        if length == 0:
            return False
        if length > len(s):
            return False
        
        # Double hash to minimize collisions
        MOD1 = 10**9 + 7
        MOD2 = 10**9 + 9
        BASE1 = 31
        BASE2 = 37
        
        n = len(s)
        
        # Precompute powers for rolling hash
        pow1 = pow(BASE1, length - 1, MOD1)
        pow2 = pow(BASE2, length - 1, MOD2)
        
        # Compute initial hash for first substring
        h1 = 0
        h2 = 0
        for i in range(length):
            h1 = (h1 * BASE1 + ord(s[i])) % MOD1
            h2 = (h2 * BASE2 + ord(s[i])) % MOD2
        
        # Count hash occurrences
        from collections import defaultdict
        hash_counts = defaultdict(int)
        hash_counts[(h1, h2)] = 1
        
        # Rolling hash for remaining substrings
        for i in range(1, n - length + 1):
            # Remove leading character
            h1 = (h1 - ord(s[i - 1]) * pow1) % MOD1
            h2 = (h2 - ord(s[i - 1]) * pow2) % MOD2
            
            # Add new character
            h1 = (h1 * BASE1 + ord(s[i + length - 1])) % MOD1
            h2 = (h2 * BASE2 + ord(s[i + length - 1])) % MOD2
            
            hash_counts[(h1, h2)] += 1
        
        # Check if any hash appears >= k times
        for count in hash_counts.values():
            if count >= k:
                return True
        
        return False
    
    # Binary search for the longest length
    left, right = 1, len(s)
    result = 0
    
    while left <= right:
        mid = (left + right) // 2
        if check(mid):
            result = mid
            left = mid + 1
        else:
            right = mid - 1
    
    return result
```

Test output (stdout)

..............                                                           [100%]
14 passed in 1.41s

Reasoning (chain-of-thought)

The problem asks me to find the length of the longest substring that occurs at least `k` times in a given string `s`.

Let me think about the approaches:

1. **Naive approach**: Enumerate all substrings and count occurrences. This would be O(n²) or worse, which won't work for n=100,000.

2. **Binary search on answer**: If a substring of length L occurs at least k times, then a substring of length L-1 also occurs at least k times (by taking any substring of length L-1 from those occurrences). This monotonicity allows binary search on the answer.

   For a fixed length L, we need to check if any substring of length L occurs at least k times. We can use rolling hash to count occurrences in O(n) time.

3. **Suffix structures**: Suffix array + LCP array, or suffix automaton can solve this in O(n) or O(n log n).

Given the constraints (n up to 100,000), I should aim for O(n log n) or O(n).

Let me implement the binary search + rolling hash approach:

1. Binary search on the answer L from 1 to len(s)
2. For each L, check if any substring of length L appears at least k times using rolling hash
3. Return the maximum L that satisfies the condition

For the rolling hash check:
- Compute hash of all substrings of length L
- Count occurrences of each hash
- If any hash appears >= k times, return True

However, there's a collision issue with rolling hash. I should use double hashing or a large modulus to minimize collisions.

Actually, let me think about this more carefully. For n=100,000, O(n log n) should be acceptable.

Let me implement:

```python
def longest_k_repeated(s: str, k: int) -> int:
    if k == 1:
        return len(s)
    if len(s) == 0:
        return 0
    
    def check(length: int) -> bool:
        if length == 0:
            return False
        if length > len(s):
            return False
        
        # Use rolling hash to count substring occurrences
        # Double hash to avoid collisions
        MOD1 = 10**9 + 7
        MOD2 = 10**9 + 9
        BASE1 = 31
        BASE2 = 37
        
        n = len(s)
        if length > n:
            return False
        
        # Compute initial hashes
        h1 = 0
        h2 = 0
        for i in range(length):
            h1 = (h1 * BASE1 + ord(s[i])) % MOD1
            h2 = (h2 * BASE2 + ord(s[i])) % MOD2
        
        # Store hash pairs and count
        from collections import defaultdict
        hash_counts = defaultdict(int)
        hash_counts[(h1, h2)] = 1
        
        # Rolling hash for remaining substrings
        for i in range(1, n - length + 1):
            # Remove leading character
            h1 = (h1 - ord(s[i-1]) * pow(BASE1, length - 1, MOD1)) % MOD1
            h2 = (h2 - ord(s[i-1]) * pow(BASE2, length - 1, MOD2)) % MOD2
            
            # Add new character
            h1 = (h1 * BASE1 + ord(s[i + length - 1])) % MOD1
            h2 = (h2 * BASE2 + ord(s[i + length - 1])) % MOD2
            
            hash_counts[(h1, h2)] += 1
        
        # Check if any hash appears >= k times
        for count in hash_counts.values():
            if count >= k:
                return True
        
        return False
    
    # Binary search for the longest length
    left, right = 1, len(s)
    result = 0
    
    while left <= right:
        mid = (left + right) // 2
        if check(mid):
            result = mid
            left = mid + 1
        else:
            right = mid - 1
    
    return result
```

I'm carefully considering the edge cases and implementation details. The binary search approach allows me to efficiently find the longest substring that appears at least k times. Double hashing helps minimize collision risks, making the solution more robust for large input strings.

The key challenge is handling the rolling hash computation efficiently while avoiding potential hash collisions. I'll need to carefully manage the hash calculations to ensure accuracy.

The implementation requires careful handling of edge cases like empty strings, single character strings, and scenarios where no substring meets the k-occurrence requirement.

I'll focus on creating a robust solution that can handle various input scenarios while maintaining computational efficiency.

The binary search approach allows me to systematically explore substring lengths, checking each potential length against the k-occurrence constraint. By using rolling hash techniques, I can efficiently count substring occurrences without exhaustive enumeration.

Key considerations include managing hash collisions, handling edge cases like empty strings, and ensuring the algorithm performs well within the 100,000 character limit.

The implementation will leverage Python's built-in collections and mathematical operations to create a concise yet powerful substring analysis method.