后缀数组（Suffix Array）

本文介绍后缀数组的定义与构建的过程。首先，文章介绍什么是后缀数组，随后讲解了最自然的朴素算法。为了引出更高效的算法，文章提及了倍增思想与基数排序的背景基础知识。接着，通过模拟演练的方式一步一步地演示如何创建后缀数组。将构建的抽象过程形象地展示出来，使读者更易理解。

定义

给定字符串 $S="banana"$ ，其所有后缀有 $s1，s2...s6$ 。（6为字符串 $S$ 的长度）。如下所示：

S  = "banana"
s1 = "banana"
s2 = "anana"
s3 = "nana"
s4 = "ana"
s5 = "na"
s6 = "a"

后缀数组即为由 $s1到s6$ 构成的有序的（字典序升序排列的）字符串数组 $vector<string> sa$ 。

vector<string> sa{"a", "ana", "anana", "banana", "na", "nana"};

构建后缀数组的动态过程演示动画： https://visualgo.net/zh/suffixarray

思路

如何构建后缀数组？

若字符串 $S$ 的长度为 $n$ ，那么有 $n$ 个后缀。先考虑朴素算法，将 $n$ 个字符串进行排序，需要 $O(n\log n)$ ，此外每一次比较两个字符串大小时需要枚举字符串的长度 $n$ 。故：

时间复杂度： $O(n^{2}\log n)$

倍增思想

朴素算法是将所有后缀字符串进行排序，其中每一个都是完整的后缀。那么，这里能否从后缀字符串的长度入手，以后缀字符串的长度为规模将问题进行分解——先排序后缀字符串长度较小的情况，再排序长度较大的情况，依次递推。

那么，在递增后缀字符串的长度时，如果以线性方式进行递推，那么在时间上仍然不会有很好的改善。此时，可以通过成倍增长（倍增）的方式进行递推，只递推状态空间中在2的整数次幂位置上的值。在计算某一个值时，可以利用之前计算过的状态空间的值拼成所需的值即可。

在计算每一个后缀字符串长度时，两两比较字符串只需要 $O(1)$ 的时间，将当前长度的后缀进行排序需要的时间为 $O(n\log n)$ 。长度递增 $\log n$ 次，所以一共需要的时间为 $O(n(\ log n)^{2})$ 。

基数排序

我们知道基数排序是一种稳定的（大小相同的元素在排序后，其相对位置不变）线性时间复杂度—— $O(n)$ 的排序算法，可以将其融入到上述中的字符串排序中。可以将时间复杂度整体降到 $O(n\log n)$ 。

模拟演算

排序长度为 $2（2^{1}）$ 的后缀数组

使用ASCII计算每一个后缀数组的第一个字符的权重（Rank）。

Index     Suffix            Rank
 0        banana             2   
 1        anana              1 
 2        nana               14 
 3        ana                1
 4        na                 14
 5        a                  1

为了能够计算出长度为2的后缀数组的顺序，还需要保留右边相邻字符的权重（Next Rank）。如果，当前字符已为最后一个，那么相邻字符的权重可认为是最小值0。

Index    Suffix            Rank          Next Rank 
 0       banana             2              1
 1       anana              1              14    
 2       nana               14             1
 3       ana                1              14
 4       na                 14             1 
 5       a                  1              0

根据权重进行基数排序，时间为 $O(n)$

Index    Suffix            Rank          Next Rank 
 5        a                  1               0
 1        anana              1               14    
 3        ana                1               14
 0        banana             2               1
 2        nana               14              1
 4        na                 14              1

排序长度为 $4（2^{2}）$ 的后缀数组

根据之前的排序结果，重新计算后缀长度为2的权重（Rank），从1开始计数。

Index       Suffix          Rank       
  5          a               1     [Assign 1 to first]        
  1          anana           2     (1, 14) is different from previous
  3          ana             2     (1, 14) is same as previous     
  0          banana          3     (2, 1) is different from previous      
  2          nana            4     (14, 1) is different from previous      
  4          na              5     (14, 1) is same as previous

同时，计算下一个相邻的长度为2的权重（Next Rank）。如果相邻的长度不足2，那么可以认为是最小值0。

Index       Suffix          Rank        Next Rank
  5          a               1              0
  1          anana           2              2      
  3          ana             2              1 
  0          banana          3              4
  2          nana            4              4 
  4          na              4              0

基数排序

Index       Suffix          Rank        Next Rank
  5          a               1              0
  3          ana             2              1 
  1          anana           2              2      
  0          banana          3              4
  4          na              4              0
  2          nana            4              4

代码

// C++ program for building suffix array of a given text
#include <iostream>
#include <cstring>
#include <algorithm>

using namespace std;

const int N = 1e5 + 5;
int c[N] = {0};
 
// Structure to store information of a suffix
struct suffix {
    int index; // To store original index
    int rank[2]; // To store ranks and next rank pair
};
 
void radixSort(suffix suffixes[], int ri, int m, int n) {
    fill(c, c + m + 1, 0);
    struct suffix output[n];
    for (int i = 0; i < n; ++i) c[suffixes[i].rank[ri]]++;
    for (int i = 1; i < m + 1; ++i) c[i] += c[i - 1];
    for (int i = n - 1; i >= 0; i--) {
        int index = --c[suffixes[i].rank[ri]];
        output[index] = suffixes[i];
    }
    memcpy(suffixes, output, sizeof(output));
}
 
// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
int *buildSuffixArray(char *txt, int n) {
    // A structure to store suffixes and their indexes
    struct suffix suffixes[n];
 
    // Store suffixes and their indexes in an array of structures.
    // The structure is needed to sort the suffixes alphabatically
    // and maintain their old indexes while sorting
    for (int i = 0; i < n; i++) {
        suffixes[i].index = i;
        suffixes[i].rank[0] = txt[i] - 'a' + 1;
        suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a' + 1): 0;
    }
 
    // Sort the suffixes using redixSort
    radixSort(suffixes, 1, 26, n);
    radixSort(suffixes, 0, 26, n);

    // At this point, all suffixes are sorted according to first
    // 2 characters.  Let us sort suffixes according to first 4
    // characters, then first 8 and so on
    int ind[n];  // This array is needed to get the index in suffixes[]
                 // from original index.  This mapping is needed to get
                 // next suffix.
    for (int k = 4; k < 2*n; k = k*2) {
        // Assigning rank and index values to first suffix
        int rank = 0;
        int prev_rank = suffixes[0].rank[0];
        suffixes[0].rank[0] = rank;
        ind[suffixes[0].index] = 0;
 
        // Assigning rank to suffixes
        for (int i = 1; i < n; i++) {
            // If first rank and next ranks are same as that of previous
            // suffix in array, assign the same new rank to this suffix
            if (suffixes[i].rank[0] == prev_rank &&
                    suffixes[i].rank[1] == suffixes[i-1].rank[1]) {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = rank;
            } else {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = ++rank;
            }
            ind[suffixes[i].index] = i;
        }
 
        // Assign next rank to every suffix
        for (int i = 0; i < n; i++) {
            int nextindex = suffixes[i].index + k/2;
            suffixes[i].rank[1] = (nextindex < n)?
                                  suffixes[ind[nextindex]].rank[0]: 0;
        }
 
        // Sort the suffixes according to first k characters
        radixSort(suffixes, 1, rank + 5, n);
        radixSort(suffixes, 0, rank + 5, n);
    }
 
    // Store indexes of all sorted suffixes in the suffix array
    int *suffixArr = new int[n];
    for (int i = 0; i < n; i++)
        suffixArr[i] = suffixes[i].index;
 
    // Return the suffix array
    return  suffixArr;
}
 
// A utility function to print an array of given size
void printArr(int arr[], int n) {
    for (int i = 0; i < n; i++)
        cout << arr[i] << " ";
    cout << endl;
}
 
// Driver program to test above functions
int main() {
    char txt[] = "banana";
    int n = strlen(txt);
    int *suffixArr = buildSuffixArray(txt,  n);
    cout << "Following is suffix array for " << txt << endl;
    printArr(suffixArr, n);
    return 0;
}

复杂度

时间复杂度： $O(n\log n)$
空间复杂度： $O(n)$