algo-string-01
1.000
Challenge · difficulty 4/5
# Longest k-repeated substring
Implement a file **`solution.py`** containing a function `longest_k_repeated`:
```python
def longest_k_repeated(s: str, k: int) -> int:
"""Return the length of the longest substring of `s` that occurs
at least `k` times."""
```
Given a string `s` and an integer `k >= 1`, return the **length of the longest
non-empty substring** of `s` that occurs **at least `k` times** in `s`. If no
non-empty substring occurs at least `k` times, return `0`.
## What counts as an occurrence
- An occurrence is a **distinct starting position** in `s`. A substring of
length `L` occurs at position `i` iff `s[i:i+L]` equals it.
- **Overlaps are allowed.** For example, in `"aaaa"` the substring `"aaa"`
occurs at positions `0` and `1`, so it occurs **2** times.
- The string is compared **exactly**, character by character; matching is
**case-sensitive** and works over arbitrary Unicode characters.
## Precise definition
Return the largest `L >= 1` such that there exists a string `w` of length `L`
for which the number of indices `i` with `s[i:i+L] == w` is `>= k`. If no such
`L` exists, return `0`.
## Edge cases
- `k == 1`: every non-empty substring occurs at least once, so for a non-empty
`s` the answer is `len(s)` (the whole string occurs once). For the empty
string the answer is `0`.
- The empty string returns `0` for **every** `k`.
- If `k` exceeds the length of the longest single-character run and the string
has no repeats at all (e.g. all-distinct characters), smaller substrings may
still fail the threshold — return `0` when nothing qualifies.
## Worked examples
```python
assert longest_k_repeated("banana", 2) == 3 # "ana" occurs at 1 and 3 (overlap)
assert longest_k_repeated("banana", 3) == 1 # only single chars occur >= 3 times
assert longest_k_repeated("banana", 4) == 0
assert longest_k_repeated("aaa", 2) == 2 # "aa" at 0 and 1
assert longest_k_repeated("aaa", 3) == 1 # "a" x3
assert longest_k_repeated("abcabc", 1) == 6 # whole string, k==1
assert longest_k_repeated("abcdef", 2) == 0 # all distinct, nothing repeats
assert longest_k_repeated("", 5) == 0
```
## Efficiency
Inputs can be **large**: `len(s)` up to about `100000`. A naive approach that
enumerates every substring is `O(n^2)` in time and memory and will time out.
Aim for roughly `O(n)` or `O(n log n)`. (A suffix automaton, or a suffix array
with LCP, or binary-search-plus-hashing all work.)
## Constraints
- `1 <= k`
- `0 <= len(s) <= 100000`
- `s` consists of arbitrary characters (tests use printable ASCII).tests/test_longest_k_repeated.py
import random
from solution import longest_k_repeated
def brute(s, k):
"""O(n^2) reference oracle for small strings."""
n = len(s)
if n == 0 or k < 1:
return 0
for L in range(n, 0, -1):
seen = {}
for i in range(n - L + 1):
sub = s[i:i + L]
c = seen.get(sub, 0) + 1
seen[sub] = c
if c >= k:
return L
return 0
def test_empty_string():
assert longest_k_repeated("", 1) == 0
assert longest_k_repeated("", 2) == 0
assert longest_k_repeated("", 5) == 0
def test_k_one_is_whole_string():
assert longest_k_repeated("a", 1) == 1
assert longest_k_repeated("abc", 1) == 3
assert longest_k_repeated("abcabc", 1) == 6
def test_no_repeat_returns_zero():
# All distinct characters: nothing occurs twice.
assert longest_k_repeated("abcdef", 2) == 0
assert longest_k_repeated("a", 2) == 0
assert longest_k_repeated("xyz", 3) == 0
def test_single_char_runs():
# "aaa": 'a' x3, 'aa' x2, 'aaa' x1
assert longest_k_repeated("aaa", 1) == 3
assert longest_k_repeated("aaa", 2) == 2
assert longest_k_repeated("aaa", 3) == 1
assert longest_k_repeated("aaa", 4) == 0
def test_banana():
# classic: "ana" occurs at positions 1 and 3 (overlapping)
assert longest_k_repeated("banana", 2) == 3
# "a" occurs 3 times, "an"/"na" twice, "ana" twice
assert longest_k_repeated("banana", 3) == 1
assert longest_k_repeated("banana", 4) == 0
def test_overlapping_counts():
# "aaaa": "aaa" occurs at 0 and 1 -> length 3 for k=2
assert longest_k_repeated("aaaa", 2) == 3
assert longest_k_repeated("aaaa", 3) == 2
assert longest_k_repeated("aaaa", 4) == 1
def test_disjoint_repeat():
# "abcXabc": "abc" occurs twice, no overlap
assert longest_k_repeated("abcXabc", 2) == 3
assert longest_k_repeated("abcXabc", 3) == 0
def test_mixed_case_sensitive():
# 'A' and 'a' are different characters.
# "AaAa": "Aa" occurs at 0 and 2 (len 2); "AaA" occurs once, "aAa" once.
assert longest_k_repeated("AaAa", 2) == 2
def test_period_two_medium():
s = "ab" * 50
n = len(s)
# periodic with period 2: s[0..n-3] == s[2..n-1]
assert longest_k_repeated(s, 2) == n - 2
def test_matches_brute_small_random():
rng = random.Random(1234)
for _ in range(400):
n = rng.randint(0, 12)
alpha = "ab" if rng.random() < 0.5 else "abc"
s = "".join(rng.choice(alpha) for _ in range(n))
for k in range(1, 6):
assert longest_k_repeated(s, k) == brute(s, k), (s, k)
def test_matches_brute_larger_alphabet():
rng = random.Random(99)
for _ in range(150):
n = rng.randint(0, 40)
s = "".join(rng.choice("abcde") for _ in range(n))
for k in range(1, 4):
assert longest_k_repeated(s, k) == brute(s, k), (s, k)
def test_large_all_same():
s = "a" * 100000
assert longest_k_repeated(s, 2) == 99999
assert longest_k_repeated(s, 100000) == 1
assert longest_k_repeated(s, 100001) == 0
def test_large_period_two():
s = "ab" * 50000
n = len(s) # 100000
assert longest_k_repeated(s, 2) == n - 2
def test_large_no_long_repeat():
# Random over a 10-char alphabet: the longest repeat must be short.
rng = random.Random(7)
s = "".join(rng.choice("abcdefghij") for _ in range(60000))
ans = longest_k_repeated(s, 2)
# There must be some repeat (pigeonhole on length-1 substrings), and it
# should be far shorter than the whole string.
assert 1 <= ans < 1000
System prompt
You are an expert programmer. Solve the task exactly as specified. Output your solution as fenced code blocks using the required file name(s) and the exact function/type signatures requested. Prefer correctness; do not include prose outside code unless asked.
Proposed solution
```python
def longest_k_repeated(s: str, k: int) -> int:
"""Return the length of the longest substring of `s` that occurs
at least `k` times."""
if k == 1:
return len(s)
if len(s) == 0:
return 0
def check(length: int) -> bool:
if length == 0:
return False
if length > len(s):
return False
# Double hash to minimize collisions
MOD1 = 10**9 + 7
MOD2 = 10**9 + 9
BASE1 = 31
BASE2 = 37
n = len(s)
# Precompute powers for rolling hash
pow1 = pow(BASE1, length - 1, MOD1)
pow2 = pow(BASE2, length - 1, MOD2)
# Compute initial hash for first substring
h1 = 0
h2 = 0
for i in range(length):
h1 = (h1 * BASE1 + ord(s[i])) % MOD1
h2 = (h2 * BASE2 + ord(s[i])) % MOD2
# Count hash occurrences
from collections import defaultdict
hash_counts = defaultdict(int)
hash_counts[(h1, h2)] = 1
# Rolling hash for remaining substrings
for i in range(1, n - length + 1):
# Remove leading character
h1 = (h1 - ord(s[i - 1]) * pow1) % MOD1
h2 = (h2 - ord(s[i - 1]) * pow2) % MOD2
# Add new character
h1 = (h1 * BASE1 + ord(s[i + length - 1])) % MOD1
h2 = (h2 * BASE2 + ord(s[i + length - 1])) % MOD2
hash_counts[(h1, h2)] += 1
# Check if any hash appears >= k times
for count in hash_counts.values():
if count >= k:
return True
return False
# Binary search for the longest length
left, right = 1, len(s)
result = 0
while left <= right:
mid = (left + right) // 2
if check(mid):
result = mid
left = mid + 1
else:
right = mid - 1
return result
```Test output (stdout)
.............. [100%] 14 passed in 1.41s
Reasoning (chain-of-thought)
The problem asks me to find the length of the longest substring that occurs at least `k` times in a given string `s`.
Let me think about the approaches:
1. **Naive approach**: Enumerate all substrings and count occurrences. This would be O(n²) or worse, which won't work for n=100,000.
2. **Binary search on answer**: If a substring of length L occurs at least k times, then a substring of length L-1 also occurs at least k times (by taking any substring of length L-1 from those occurrences). This monotonicity allows binary search on the answer.
For a fixed length L, we need to check if any substring of length L occurs at least k times. We can use rolling hash to count occurrences in O(n) time.
3. **Suffix structures**: Suffix array + LCP array, or suffix automaton can solve this in O(n) or O(n log n).
Given the constraints (n up to 100,000), I should aim for O(n log n) or O(n).
Let me implement the binary search + rolling hash approach:
1. Binary search on the answer L from 1 to len(s)
2. For each L, check if any substring of length L appears at least k times using rolling hash
3. Return the maximum L that satisfies the condition
For the rolling hash check:
- Compute hash of all substrings of length L
- Count occurrences of each hash
- If any hash appears >= k times, return True
However, there's a collision issue with rolling hash. I should use double hashing or a large modulus to minimize collisions.
Actually, let me think about this more carefully. For n=100,000, O(n log n) should be acceptable.
Let me implement:
```python
def longest_k_repeated(s: str, k: int) -> int:
if k == 1:
return len(s)
if len(s) == 0:
return 0
def check(length: int) -> bool:
if length == 0:
return False
if length > len(s):
return False
# Use rolling hash to count substring occurrences
# Double hash to avoid collisions
MOD1 = 10**9 + 7
MOD2 = 10**9 + 9
BASE1 = 31
BASE2 = 37
n = len(s)
if length > n:
return False
# Compute initial hashes
h1 = 0
h2 = 0
for i in range(length):
h1 = (h1 * BASE1 + ord(s[i])) % MOD1
h2 = (h2 * BASE2 + ord(s[i])) % MOD2
# Store hash pairs and count
from collections import defaultdict
hash_counts = defaultdict(int)
hash_counts[(h1, h2)] = 1
# Rolling hash for remaining substrings
for i in range(1, n - length + 1):
# Remove leading character
h1 = (h1 - ord(s[i-1]) * pow(BASE1, length - 1, MOD1)) % MOD1
h2 = (h2 - ord(s[i-1]) * pow(BASE2, length - 1, MOD2)) % MOD2
# Add new character
h1 = (h1 * BASE1 + ord(s[i + length - 1])) % MOD1
h2 = (h2 * BASE2 + ord(s[i + length - 1])) % MOD2
hash_counts[(h1, h2)] += 1
# Check if any hash appears >= k times
for count in hash_counts.values():
if count >= k:
return True
return False
# Binary search for the longest length
left, right = 1, len(s)
result = 0
while left <= right:
mid = (left + right) // 2
if check(mid):
result = mid
left = mid + 1
else:
right = mid - 1
return result
```
I'm carefully considering the edge cases and implementation details. The binary search approach allows me to efficiently find the longest substring that appears at least k times. Double hashing helps minimize collision risks, making the solution more robust for large input strings.
The key challenge is handling the rolling hash computation efficiently while avoiding potential hash collisions. I'll need to carefully manage the hash calculations to ensure accuracy.
The implementation requires careful handling of edge cases like empty strings, single character strings, and scenarios where no substring meets the k-occurrence requirement.
I'll focus on creating a robust solution that can handle various input scenarios while maintaining computational efficiency.
The binary search approach allows me to systematically explore substring lengths, checking each potential length against the k-occurrence constraint. By using rolling hash techniques, I can efficiently count substring occurrences without exhaustive enumeration.
Key considerations include managing hash collisions, handling edge cases like empty strings, and ensuring the algorithm performs well within the 100,000 character limit.
The implementation will leverage Python's built-in collections and mathematical operations to create a concise yet powerful substring analysis method.