字符串搜索算法

字符串搜索是指在一个字符串中查找另一个字符串（子串）是常用的基本编程操作。协议解析、文本编辑都会用到大量字符串搜索操作。从1977年至今，人们对此进行了大量研究，本文介绍本领域典型算法：暴力搜索、two-way、KMP、BM、BMH、BMS，以及x86 CPU SSE2 硬件加速实现。

1 问题定义

字符串搜索问题，又称为文本匹配问题。其定义如下：

定义长度为\(n\)的字符数组\(text\)；
定义长度为\(m\)的字符数组\(pattern\)，且\(m\le{}n\)；
定义有限字符集\(\Sigma\)，\(text\)和\(pattern\)由\(\Sigma\)元素组成；
在\(text\)中查找\(pattern\)出现的位置。

2 测试结果

测试Linux系统glibc strstr，native, KMP, BM, BMH算法。使用大量随机字符串测试，其性能排名是：strstr、BMH、BM、native、KMP。KMP的实现应该是有问题，后期分析更正。

3 glibc C库字符串算法

3.1 概要介绍

GLIBC库提供two-way搜索算法，以及针对各平台的硬件优化实现(X86平台是SSE2)。TWO WAY 算法由Maxime Crochemore和Dominique Perrin发表于1991年ACM。主要包括一个关键分解理论(critical factorization theorem pdf)。

代码实现参考Github string/strstr.c、x86_64/multiarch strstr.c和 strstr-sse2-unaligned.S。

3.2 参考资料

TWO WAY ALGORITHM: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
TWO WAY STRING MATCHING (Maxime Crochemore, Dominique Perrin): https://dl.acm.org/citation.cfm?id=116845

4 Boyer-Moore

[BM 77]

4.1 概要介绍

BM算法由Robert S. Boyer和J Strother Moore于1977年发表。BM算法使用坏字符和好后缀规则，在失配后一次跳动多个字符，进行匹配加速。

4.2 参考资料

WIKI: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
ACM: http://dl.acm.org/citation.cfm?doid=359842.359859

4.3 代码实现

2012/2/29实现此算法，发布于GitHub：

/*
 * [BM 77] A fast string search algorithm,
 *         R. S. Boyer and J. S. Moore,
 *         Comm. ACM, 20, 1977, pp. 762–772.
 * [Smit 1982] A Comparison of Three String Matching Algorithms,
 *             Smit G. De V.,
 *             Software-Practice and Experience, 12(1), 57-66, 1982
 */

#include <config-os.h>
#include <ycc/common/debug.h>
#include <ycc/algos/string.h>

void strbm_init(const char *needle, size_t n,
                size_t *table_sgs, size_t *table_ebc)
{
        /*
         * table_sgs: strong good suffix
         * ebc: extended bad character
         */

        size_t i, j, k;
        size_t *backup = table_sgs + n;

        /* init */
        memset(table_sgs, 0, n*sizeof(*table_sgs));

        /* suffix */
        i = n - 1;
        j = n;
        while ((size_t)-1 != i) {
                backup[i] = j;

                while (j < n && needle[i] != needle[j]) {
                        if (!table_sgs[j])
                                table_sgs[j] = j - i;

                        j = backup[j];
                }

                --i;
                --j;
        }

        /* prefix */
        k = j + 1;
        for (i = 0; i < k; ++i) {
                if (!table_sgs[i])
                        table_sgs[i] = k;
        }

        i = backup[j];
        while (j < n) {
                while (j < i) {
                        if (!table_sgs[j])
                                table_sgs[j] = i;
                        ++j;
                }
                i = backup[i];
        }

        /* ebc ... */
        strbmh_init(needle, n, table_ebc);
}

char *strbm_find(const char *haystack, const char *needle,
                 size_t h, size_t n,
                 const size_t *table_sgs, const size_t *table_ebc)
{
        size_t i, j, n1 = n - 1;

        if (!n)
                return (char*)haystack;

        while (h > n1) {
                for (i = n1; i != (size_t)-1 && haystack[i] == needle[i]; --i);

                if (i == (size_t)-1)    /* [0, n1] are equal. */
                        return (char*)haystack;

                if (table_sgs[i] > table_ebc[(u_char)haystack[n1]])
                        j = table_sgs[i];
                else
                        j = table_ebc[(u_char)haystack[n1]];
                DBG_PR("jump offset = %zu, h,n = %zu, %zu", j, h, n);
                h -= j;
                haystack += j;
        }

        return NULL;
}

5 Boyer-Moore-Horspool

[Hor 80]

5.1 概要介绍

BMH算法由Nigel Horspool发表于1980年。在随机测试结果中，比BM有更好的表现，性能提升10%左右。BMH是BM算法的简化。

5.2 参考资料

WIKI: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm
Horsen: http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/horsen.htm

5.3 代码实现

2012/2/29发布于GitHub：

/*
 * [Hor 80] Practical Fast Searching in Strings,
 *          R. Nigel Horspool,
 *          Softw. Pratt. Exp., 10, 501–506 (1980).
 */

#include <assert.h>
#include <limits.h>
#include <stddef.h>
#include <sys/types.h>

#include <config-os.h>

#include <ycc/algos/string.h>

void strbmh_init(const char *needle, size_t n, size_t *table)
{
        size_t i;

        for (i = 0; i <= UCHAR_MAX; ++i)
                table[i] = n;

        --needle;
        for (i = 1; i < n; ++i)
                table[(u_char)needle[i]] = n - i;
}

char *strbmh_find(const char *haystack, const char *needle,
                  size_t h, size_t n,
                  const size_t *table)
{
        size_t i, n1 = n - 1;

        if (!n)
                return (char*)haystack;

        while (h > n1) {
                for (i = n1; i != (size_t)-1 && haystack[i] == needle[i]; --i);

                if (i == (size_t)-1)    /* [0, n1] are equal. */
                        return (char*)haystack;

                h -= table[(u_char)haystack[n1]];
                haystack += table[(u_char)haystack[n1]];
        }

        return NULL;
}

6 Boyer-Moore-Sunday

[Sun 90]

7 Knuth-Morris-Pratt

[KMP 77]

7.1 概要介绍

7.2 参考资料

WIKI: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

7.3 代码实现

此算法测试结果十分糟糕，怀疑实现有问题。以后更正分析。 2012/2/29发布于GitHub：

/*
 * [KMP 77] Fast pattern matching in strings,
 *          D. E. Knuth, J. H. Morris, Jr and V. B. Pratt, SIAM J.
 *          Computing, 6, 1977, pp. 323–350.
 */

#include <assert.h>
#include <stddef.h>

#include <config-os.h>

#include <ycc/algos/string.h>

void strkmp_init(const char *patn, size_t *table)
{
        size_t i, *t = table;
        const char *s = patn + 1;

        assert(patn && table);

        *t++ = 0;
        for(i = 0; *s; ++s) {
                while (i && patn[i] != *s)
                        i = table[i-1];

                if (patn[i] == *s)
                        ++i;

                *t++ = i;
        }
}

char *strkmp_find(const char *text, const char *patn, const size_t *table)
{
        size_t i;
        const char *s;

        assert(text && patn && table);

        if (!*patn)
                return (char*)text;

        for (i = 0, s = text; *s; ++s) {
                while (i && *s != patn[i])
                        i = table[i-1];

                if (*s == patn[i]) {
                        if (!patn[i+1])
                                return (char*)s - i;
                        ++i;
                }
        }

        return NULL;
}

8 References

why grep so fast: https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
grep code: https://savannah.gnu.org/git/?group=grep
grep git url: git clone https://git.savannah.gnu.org/git/grep.git
(no term): http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html
WIKI: https://en.wikipedia.org/wiki/String_searching_algorithm
Stackoverflow Discuss: https://stackoverflow.com/questions/7586990/strstr-faster-than-algorithms
EXACT STRING MATCHING ALGORITHMS: http://www-igm.univ-mlv.fr/~lecroq/string/index.html