KMP算法

Published on	7/29/2025
Last updated on	12/4/2022
Created on	12/17/2021

KMP算法（Knuth–Morris–Pratt algorithm）是个高效的字符串匹配算法。

方法

假设我们要做以下匹配：

\begin{aligned} &\text{文本串}A:\quad \mathtt{ \underset{1}{A}\ \, \underset{2}{B}\ \, \underset{3}{A}\ \, \underset{4}{B}\ \, \underset{5}{A}\ \, \underset{6}{B}\ \, \underset{7}{A}\ \, \underset{8}{A}\ \, \underset{9}{B}\ \underset{10}{A}\ \underset{11}{B}\ \underset{12}{A}\ \underset{13}{C}\ \underset{14}{B} }\\ &\text{模式串}B:\quad \mathtt{ \underset{1}{A}\ \, \underset{2}{B}\ \, \underset{3}{A}\ \, \underset{4}{B}\ \, \underset{5}{A}\ \, \underset{6}{C}\ \, \underset{7}{B} } \end{aligned}

很明显，一开始匹配到第 $\mathtt{6}$ 位时字符就不同了，传统匹配方法会将模式串往前移一位，再重新一位一位匹配，但我们知道，文本串前 $\mathtt{5}$ 位是和模式串相同的，如果将模式串往前移动一位后能够匹配成功，肯定有个前提是模式串的第 $\mathtt{1}$ 位和第 $\mathtt{2}$ 位相同，不然是肯定匹配不成的，这几次检查就白费了。那我们该移动几位呢？模式串和文本串的前 $\mathtt{5}$ 位是相同的，我们把这一部分称为模式串的前缀 $P$ ，我们往前移动模式串，肯定要移动到一个有可能匹配成功的位置，我们发现，如果相同部分 $P$ 的前 $k$ 位和后 $k$ 位相同，那我们可以移动模式串使得模式串的前 $k$ 位与文本串相同部分的后 $k$ 位匹配，比如，在我们例子里的 $P$ 中：前 $\mathtt{1}$ 位和后 $\mathtt{1}$ 位相同：

\mathtt{ \textcolor{Green}{ \underset{1}{A}\ \, } \textcolor{SpringGreen}{ \underset{2}{B}\ \, \underset{3}{A}\ \, \underset{4}{B}\ \, } \textcolor{Green}{ \underset{5}{A}\ \, } \textcolor{Gray}{ \underset{6}{C}\ \, \underset{7}{B} } }

前 $\mathtt{3}$ 位和后 $\mathtt{3}$ 位相同：

\mathtt{ \textcolor{Green}{ \underset{1}{A}\ \, \underset{2}{B}\ \, } \textcolor{OliveGreen}{ \underset{3}{A}\ \, } \textcolor{Green}{ \underset{4}{B}\ \, \underset{5}{A}\ \, } \textcolor{Gray}{ \underset{6}{C}\ \, \underset{7}{B} } }

因此我们可以直接把这 $\mathtt{3}$ 位对齐，然后直接依次往后比较 $A[6]$ 与 $B[4]$ 、 $A[7]$ 与 $B[5]$ 、 $A[8]$ 与 $B[6]$ 等等

\begin{aligned} &\text{文本串}A:\quad \mathtt{ \textcolor{SpringGreen}{ \underset{1}{A}\ \, \underset{2}{B}\ \, } \textcolor{Green}{ \underset{3}{A}\ \, \underset{4}{B}\ \, \underset{5}{A}\ \, } \underset{6}{B}\ \, \underset{7}{A}\ \, \underset{8}{A}\ \, \underset{9}{B}\ \underset{10}{A}\ \underset{11}{B}\ \underset{12}{A}\ \underset{13}{C}\ \underset{14}{B} }\\ &\text{模式串}B:\quad \mathtt{ \qquad \textcolor{Green}{ \underset{1}{A}\ \, \underset{2}{B}\ \, \underset{3}{A}\ \, } \textcolor{SpringGreen}{ \underset{4}{B}\ \, \underset{5}{A}\ \, } \underset{6}{C}\ \, \underset{7}{B} } \end{aligned}

我们对齐尽可能多的位数，比如这里对齐 $\mathtt{3}$ 位，而不是 $\mathtt{1}$ 位，这是因为要对齐 $\mathtt{1}$ 位我们需要往前移动模式串 $\mathtt{4}$ 位，中途跳过了对齐 $\mathtt{3}$ 位的情况，这会有漏解的隐患。 KMP 算法的核心是匹配失败后，算法会根据已匹配的模式串字符数量（ $P$ 的长度）来向前移动模式串，使其前缀与文本串中已匹配部分的相同后缀重合，再开始新一轮的比较。算法开始时，我们会预处理出一个数组 $\mathtt{kmp[]}$ ，长度与模式串相同， $\mathtt{kmp[k]}$ 表示已经匹配成功 $\mathtt{k}$ 位了，这时匹配失败，我们可以向前移动模式串，使模式串的前 $\mathtt{kmp[k]}$ 位保持与文本串的对应位置匹配（上述其实跟文本串是啥没关系，因为我们知道当前处理的部分是匹配的），再继续匹配。这个对齐的过程在 KMP 算法中经常出现，模式串的位置会经常改变，所以我们用两个索引来记录：

\begin{aligned} \mathtt{i}:\ &\text{正在匹配的文本串字符位置}\\ \mathtt{j}:\ &\text{模式串中我们已匹配成功的位置}\\ \end{aligned}

我们可以表示出来：

\begin{aligned} \text{文本串}A&:\quad \mathtt{ \textcolor{SpringGreen}{ \underset{1}{A}\ \, \underset{2}{B}\ \, } \textcolor{Green}{ \underset{3}{A}\ \, \underset{4}{B}\ \, \underset{5}{A}\ \, } }\mathtt{ \overset{ \textcolor{Orange}{\displaystyle{i}} }{ \underset{6}{B} }\ \, }\mathtt{ \underset{7}{A}\ \, \underset{8}{A}\ \, \underset{9}{B}\ \underset{10}{A}\ \underset{11}{B}\ \underset{12}{A}\ \underset{13}{C}\ \underset{14}{B} }\\ \text{模式串}B&:\quad \mathtt{ \qquad \textcolor{Green}{ \underset{1}{A}\ \, \underset{2}{B}\ \, } } \mathtt{ \underset{ \textcolor{Orange}{\displaystyle{j}} }{\textcolor{Green}{ \underset{3}{A} } }}\ \, \mathtt{ \textcolor{SpringGreen}{ \underset{4}{B}\ \, \underset{5}{A}\ \, } \underset{6}{C}\ \, \underset{7}{B} } \end{aligned}

每次对齐后，我们会检查 $A[i]$ 是否与 $B[j+1]$ 相同，是则继续比较，否则就没必要再比下去了，再对齐一次

以下内容来自KMP2.md

$\Sigma$ 表示字符集， $\Sigma^*$ 表示字符集中字符组成的字符串， $c$ 表示任意字符， $S$ 表示任意字符串， $M$ 表示自动机

对于自动机 $M$ ， $Q$ 表示状态集， $q$ 表示任意状态， $q_0$ 表示初始状态， $A\subseteq Q$ 表示接受状态集 $\delta(q,c):Q\times\Sigma\to Q$ 表示转移函数 $\phi(S):\Sigma^*\to Q$ 表示终态函数

对于模式串 $P$ ： $\sigma_P(S):\Sigma^*\to\N$ 表示后缀函数，表示： $P$ 从头匹配， $S$ 从尾匹配，最大的匹配长度。严谨地说，是与 $P$ 某一前缀匹配的 $S$ 最长后缀之长度。形式化地说：

\begin{align*} \sigma_P(S):=&\max\{k|\ P[k\!:]=S[:\!k]\ \}\\ =&\max\{k|\ P_k\sqsupset S\ \} \end{align*}

当 $\sigma_P(S[k\!:])=|P|$ 时，我们便能知道 $S[k-|P|+1\!:\!k]=P$ ，即 $k-|P|$ 为一个有效偏移。

我们想要在自动机上转移时，时刻保证：

^\#\phi(S[k\!:])=\sigma_P(S[k\!:])\qquad(1)

由此，我们便可定义状态集和接受状态集：

\begin{align*} Q&=\{q_i|i\in[0,|P|]\}\\ A&=\{q_{|P|}\}\\ \end{align*}

由 $(1)$ 可推知：

\begin{align*} ^\#\phi(S[k\!:]c)&=\sigma_P(S[k\!:]c)\\ ^\#\delta(\phi(S[k\!:]),c)&=\sigma_P(S[k\!:]c)\\ ^\#\delta(\sigma_P(S[k\!:]),c)&=\sigma_P(S[k\!:]c)\\ \end{align*}

我们有：

\begin{align*} \sigma_P(P[\sigma_P(S[k\!:])\!:]c)&=\sigma_P(S[k\!:]c)\\ \end{align*}

图形化证明：

P: \overbrace{ \fcolorbox{grey}{black}{\!\phantom{c}\!} \fcolorbox{grey}{black}{\!\phantom{c}\!} \fcolorbox{grey}{black}{\!\phantom{c}\!} }^{P[:\sigma]} \fcolorbox{b}{white}{\!c\!} \fcolorbox{grey}{white}{\!\phantom{c}\!} \dots \fcolorbox{grey}{white}{\!\phantom{c}\!} \quad S:\underbrace{ \fcolorbox{grey}{white}{\!\phantom{c}\!} \dots \fcolorbox{grey}{white}{\!\phantom{c}\!} \overbrace{ \fcolorbox{grey}{black}{\!\phantom{c}\!} \fcolorbox{grey}{black}{\!\phantom{c}\!} \fcolorbox{grey}{black}{\!\phantom{c}\!} }^{S[k:][:\sigma]} }_{S[k:]} \fcolorbox{b}{white}{\!c\!}

黑色部分为追加字符前的 $\sigma_P$ 相等部分（见定义）。

追加字符后， $\sigma_P(S[k\!:]c)$ 的取值不会超出黑色部分的左侧，否则 $\sigma_P(S[k\!:])$ 会有更优解（即黑色部分可以更长），与定义矛盾。

由此，我们定义状态转移函数：

\delta(q,c):=\sigma_P(P[q\!:]c)

其取值表可以通过预处理得出。

引理 $1$ ： $\sigma_P(Sc)\le \sigma_P(S)+1$

引理 $2$ ：

\forall P:\Sigma^*,k:\N,c:\Sigma,\\ \sigma_P(Sc)=\sigma_P(P[\sigma_P(S)\!:]c)

设 $q:=\sigma_P(S)\quad r:=\sigma_P(Sc)$ ，即目标变为：

r=\sigma_P(P[q:]c)

由定义， $P[q\!:]=S[:\! q]$ ，即 $P[q\!:]\sqsupset S$ 。

由后缀的传递性， $P[q\!:]c\sqsupset Sc\qquad(\rm{a})$

由定义， $P[r\!:]=Sc[:\! r]$ ，即 $P[r\!:]\sqsupset Sc$

由引理 $1$ ， $r\le q+1$ ，即 $|P[r\!:]|\le|P[q\!:]c|$ ，由后缀重合定理，前述两者均为 $Sc$ 的后缀，因此能推出 $P[r\!:]\sqsupset P[q\!:]c$ 。

由后缀函数的定义： $S_1\sqsupset S_2\Rarr \sigma_P(S_1)\le\sigma_P(S_2)$

因此， $\sigma_P(P[r\!:])\le\sigma_P(P[q\!:]c)$ ，即 $r\le\sigma_P(P[q\!:]c)$ 也因此，由 $(\rm{a})$ ， $\sigma_P(P[q\!:]c)\le\sigma_P(Sc)$ ，即 $\sigma_P(P[q\!:]c)\le r$

即， $r=\sigma_P(P[q\!:]c)$ ，引理得证。