Skip to content

Project1

Bae jiun, Maybe edited this page Oct 7, 2017 · 4 revisions

This page for documentation Project1: Signal

Review

There's three versions of my implementation.

Of course, the multi thread implementation has the best performance. but it return query mismatch randomly on server (tiny and small set work well). And mismatched query is change every time with same code (if remove race-condition;just single thread). In the end, I switched to a lower performance implementation(master).

Features

How it works?

First of all, main idea is aho-corasick algorithm. But my implementation is little bit different. Check from below.

Aho-corasick Algorithm

This for find matching patterns in various patterns from query. There is KMP Algorithm for find matching single pattern from query. But it takes O(m + zn) in set of patterns (m as total length of patterns, z as count of patterns and n as size of query). Aho-corasick can solve this problem in O(m + n + k) (k as number of patterns in query).

There is three major parts in Aho-corasick Algorithm.

  • Construct Trie for patterns.
    • Trace trie nodes and create node if there are no matches.
  • Make Failure Link and Output Link
    • Failure Link for prevent looking back from the beginning.
    • Output Link ensure that works correctly even when sub string exists.
  • Find matching
    • Follow the trie nodes, find the pattern if there is an output link.

My Implementation

Redesign for performance.

  • Change Trie structure to 2d array(maximum state size × char size(a-z))
  • Define Final, Normal and Init states as automata.
  • Design efficient add / deletion algorithm.

Process

  1. Make 2d array for Aho-corasick accept automata. (Table::update_table)
    1. Define Init State as size of patterns.
    2. From Init State, trace char of pattern and make node.
    3. It is Final State if State < Init State else Normal State.
  2. Lazy update when add / deletion.
    1. Insert pattern to buffer when add/deletion occurs.
    2. Update table collectively, when match requested.
    3. Resize 2d array if pattern size overflow and clean up.
    4. Rebuild table.
  3. Find match
    1. Trace from Init State to until meet -1(this is out state)
    2. Found matched pattern if State < Init State
    3. Append to result list if no exists in results.

Example

Example Input as

3 // init pattern count
abc
ab
bbc
Q abc
Q bbca
A bc
Q abbc

After init patterns, we can make 2d array as below

a b c
0 - - -
1 - - 0
2 - - -
3 4 6 -
4 - 1 -
5 - - -
6 - 7 -
7 - - 2

follow update_table()

And abc requested then start at 3(Init State), and follow node with 'a'. [3][a] is 4, so next state is 4. Find next char with next state, [4][b] is 1. 1 is Final State(< Init State). So, there is matched pattern marked as 1(ab). And next [1][c] is 0 and it is Final State too. Finally, find two accept pattern(0, 1). And It is ab, abc.

Waiting contributes!

​you can comments and make merge requests to fix my codes. 😄

Clone this wiki locally