本文介绍后缀数组的定义与构建的过程。首先,文章介绍什么是后缀数组,随后讲解了最自然的朴素算法。为了引出更高效的算法,文章提及了倍增思想与基数排序的背景基础知识。接着,通过模拟演练的方式一步一步地演示如何创建后缀数组。将构建的抽象过程形象地展示出来,使读者更易理解。
定义
给定字符串S="banana",其所有后缀有s1,s2...s6。(6为字符串S的长度)。如下所示:
S = "banana" s1 = "banana" s2 = "anana" s3 = "nana" s4 = "ana" s5 = "na" s6 = "a"
后缀数组即为由s1到s6构成的有序的(字典序升序排列的)字符串数组vector<string> sa。
vector<string> sa{"a", "ana", "anana", "banana", "na", "nana"};
构建后缀数组的动态过程演示动画: https://visualgo.net/zh/suffixarray
思路
如何构建后缀数组?
若字符串S的长度为n,那么有n个后缀。先考虑朴素算法,将n个字符串进行排序,需要O(n\log n),此外每一次比较两个字符串大小时需要枚举字符串的长度n。故:
时间复杂度:O(n^{2}\log n)
倍增思想
朴素算法是将所有后缀字符串进行排序,其中每一个都是完整的后缀。那么,这里能否从后缀字符串的长度入手,以后缀字符串的长度为规模将问题进行分解——先排序后缀字符串长度较小的情况,再排序长度较大的情况,依次递推。
那么,在递增后缀字符串的长度时,如果以线性方式进行递推,那么在时间上仍然不会有很好的改善。此时,可以通过成倍增长(倍增)的方式进行递推,只递推状态空间中在2的整数次幂位置上的值。在计算某一个值时,可以利用之前计算过的状态空间的值拼成所需的值即可。
在计算每一个后缀字符串长度时,两两比较字符串只需要O(1)的时间,将当前长度的后缀进行排序需要的时间为O(n\log n)。长度递增\log n次,所以一共需要的时间为O(n(\ log n)^{2})。
基数排序
我们知道基数排序是一种稳定的(大小相同的元素在排序后,其相对位置不变)线性时间复杂度——O(n)的排序算法,可以将其融入到上述中的字符串排序中。可以将时间复杂度整体降到O(n\log n)。
模拟演算
排序长度为2(2^{1})的后缀数组
使用ASCII计算每一个后缀数组的第一个字符的权重(Rank)。
Index Suffix Rank 0 banana 2 1 anana 1 2 nana 14 3 ana 1 4 na 14 5 a 1
为了能够计算出长度为2的后缀数组的顺序,还需要保留右边相邻字符的权重(Next Rank)。如果,当前字符已为最后一个,那么相邻字符的权重可认为是最小值0。
Index Suffix Rank Next Rank 0 banana 2 1 1 anana 1 14 2 nana 14 1 3 ana 1 14 4 na 14 1 5 a 1 0
根据权重进行基数排序,时间为O(n)
Index Suffix Rank Next Rank 5 a 1 0 1 anana 1 14 3 ana 1 14 0 banana 2 1 2 nana 14 1 4 na 14 1
排序长度为4(2^{2})的后缀数组
根据之前的排序结果,重新计算后缀长度为2的权重(Rank),从1开始计数。
Index Suffix Rank 5 a 1 [Assign 1 to first] 1 anana 2 (1, 14) is different from previous 3 ana 2 (1, 14) is same as previous 0 banana 3 (2, 1) is different from previous 2 nana 4 (14, 1) is different from previous 4 na 5 (14, 1) is same as previous
同时,计算下一个相邻的长度为2的权重(Next Rank)。如果相邻的长度不足2,那么可以认为是最小值0。
Index Suffix Rank Next Rank 5 a 1 0 1 anana 2 2 3 ana 2 1 0 banana 3 4 2 nana 4 4 4 na 4 0
基数排序
Index Suffix Rank Next Rank 5 a 1 0 3 ana 2 1 1 anana 2 2 0 banana 3 4 4 na 4 0 2 nana 4 4
代码
// C++ program for building suffix array of a given text
#include <iostream>
#include <cstring>
#include <algorithm>
using namespace std;
const int N = 1e5 + 5;
int c[N] = {0};
// Structure to store information of a suffix
struct suffix {
int index; // To store original index
int rank[2]; // To store ranks and next rank pair
};
void radixSort(suffix suffixes[], int ri, int m, int n) {
fill(c, c + m + 1, 0);
struct suffix output[n];
for (int i = 0; i < n; ++i) c[suffixes[i].rank[ri]]++;
for (int i = 1; i < m + 1; ++i) c[i] += c[i - 1];
for (int i = n - 1; i >= 0; i--) {
int index = --c[suffixes[i].rank[ri]];
output[index] = suffixes[i];
}
memcpy(suffixes, output, sizeof(output));
}
// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
int *buildSuffixArray(char *txt, int n) {
// A structure to store suffixes and their indexes
struct suffix suffixes[n];
// Store suffixes and their indexes in an array of structures.
// The structure is needed to sort the suffixes alphabatically
// and maintain their old indexes while sorting
for (int i = 0; i < n; i++) {
suffixes[i].index = i;
suffixes[i].rank[0] = txt[i] - 'a' + 1;
suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a' + 1): 0;
}
// Sort the suffixes using redixSort
radixSort(suffixes, 1, 26, n);
radixSort(suffixes, 0, 26, n);
// At this point, all suffixes are sorted according to first
// 2 characters. Let us sort suffixes according to first 4
// characters, then first 8 and so on
int ind[n]; // This array is needed to get the index in suffixes[]
// from original index. This mapping is needed to get
// next suffix.
for (int k = 4; k < 2*n; k = k*2) {
// Assigning rank and index values to first suffix
int rank = 0;
int prev_rank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
// Assigning rank to suffixes
for (int i = 1; i < n; i++) {
// If first rank and next ranks are same as that of previous
// suffix in array, assign the same new rank to this suffix
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1]) {
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
} else {
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
// Assign next rank to every suffix
for (int i = 0; i < n; i++) {
int nextindex = suffixes[i].index + k/2;
suffixes[i].rank[1] = (nextindex < n)?
suffixes[ind[nextindex]].rank[0]: 0;
}
// Sort the suffixes according to first k characters
radixSort(suffixes, 1, rank + 5, n);
radixSort(suffixes, 0, rank + 5, n);
}
// Store indexes of all sorted suffixes in the suffix array
int *suffixArr = new int[n];
for (int i = 0; i < n; i++)
suffixArr[i] = suffixes[i].index;
// Return the suffix array
return suffixArr;
}
// A utility function to print an array of given size
void printArr(int arr[], int n) {
for (int i = 0; i < n; i++)
cout << arr[i] << " ";
cout << endl;
}
// Driver program to test above functions
int main() {
char txt[] = "banana";
int n = strlen(txt);
int *suffixArr = buildSuffixArray(txt, n);
cout << "Following is suffix array for " << txt << endl;
printArr(suffixArr, n);
return 0;
}
复杂度
- 时间复杂度:O(n\log n)
- 空间复杂度:O(n)
参考
- https://www.geeksforgeeks.org/pattern-searching-using-suffix-tree/
- https://www.geeksforgeeks.org/suffix-array-set-1-introduction/
- https://www.geeksforgeeks.org/suffix-array-set-2-a-nlognlogn-algorithm/
- https://en.wikipedia.org/wiki/Suffix_array
- 《算法竞赛进阶指南》
原文链接:后缀数组(Suffix Array)