The true measure of the value of any business leader and manager is performance.
String matching is a very common problem, given a text $T$ and a pattern $P$ find the occurrences of $P$ in $T$. This problem doesn't need too much introduction.
The most easy (and naive) algorithm to solve this problem is to slide off $P$ through $T$ and see if there is a match. Something like this:
for (int i = 0; i + P.length <= T.length; i++) {
int j = 0;
while (P[j] == T[i+j])
j++;
if (j == P.length)
// We found a match!
}
The $Z_{i}(T)$ function
For example, let be T = "cabacadcab", $Z_{3} = 0$, $Z_{4} = 2$ and $Z_{7} = 3$. Obviously $Z_{0}$ is always equal to the length of the string. See figure ? .
Z Boxes
Assume that we have already computed the values of $Z$ upto some $k-1$ and now we need to compute $Z_{k}$. There are four possible cases. In the following pictures $l$ and $r$ denote de start and the end of the last Z-box.
Case 1: $ k $ is out of the last Z-box
Position $k$ is not contained in the last Z-box. We need to compare character by character to find $Z_{k}$.
Case 2: $ k $ is within the last Z-box
We'll denote this last Z-box as $\alpha$ and as $\beta$ the box starting at position $k$ and ending at $r$. Since every Z-box matches a prefix, the figure ? depicts our situation.
As you can see, $k'$ corresponds to position $k$ in the prefix and we already computed $Z_{k'}$ and so we can leverage this fact. There are three more cases.
Case 2a: $ Z_{k'} < |\beta| $
In this case $Z_{k} = Z_{k'}$.
Case 2b: $ Z_{k'} > |\beta| $
Let be $x$ the first character that is not contained in the last Z-box and $y$ the first character that is not contained in the prefix, we know one thing, $x \neq y$ and therefore $Z_{k}$ cannot be greater than $Z_{k'}$. Result $Z_{k} = |\beta|$.
Case 2c: $ Z_{k'} = |eta| $
Here we know two things, $x \neq y$ and $y \neq w$. How about $x$ and $w$? We don't know, they may be equal or not. In this case it's necessary to verify.
Okay, here is an implementation to complement the explanation:
Z[0] = n;
int l = 0, r = 0;
for (int k = 1; k < n; k++) {
if (r < k) {
l = r = k;
while (S[r] == S[r-l])
r++;
Z[k] = r - l;
} else {
int b = r - k;
int j = k - l; // j is k'
if (Z[j] < b) {
Z[k] = Z[j];
} else if (Z[j] > b) {
Z[k] = b;
} else {
l = k;
r = k + b;
while (S[r] == S[r-l])
r++;
Z[k] = r - l;
}
}
}
Since $l$ and $r$ never decrease the complexity of this algorithm is $O(n)$.
String matching using the Z function
Now that we know how to compute the Z function let's see how to use it to find the occurrences of a pattern $P$ in a text $T$. The idea is easy:
- Concatenate $P$ with $T$ and compute the Z function in the resulting string ($S$ = $P$ + $T$ )
- There is an occurrence of $P$ that start at position $i >= |P|$ if $Z_{i}(S) >= |P|$.
Practice problems
Here are some problems to put into practice this algorithm:
- Codeforces Round #246 Div 2 Problem D: Prefixes and Suffixes | My solution
- Light OJ: 1255 - Substring Frequency | My solution
References
[1] | DAN GUSFIELD, Linear-time pattern matching. Z-values and Z-algorithm, http://www.cs.ucdavis.edu/~gusfield/cs122f10/videolist.html |