In Search Ads, there are many systems you may know such as Google Ads, CocCoc ads, .... Each system will have different components. But for me, it should have 4 main parts
- Indexing: Load everything such as keywords, adverts, and campaigns ... in the system before serving any request to show ads based on the query of Users.
- Matching: Focus on how the keyword can match with the search query
- Ranking: Sort the list matching keywords through strategy help to improve the quality of ads
- Pricing: Use the auction price model to charge for ads to be displayed.
For a large system like CocCoc, which has 30 million of Users, or Google, which has a billion Users. It will lead to the number of keywords being huge, we cannot take all those keywords to check matching with the search query. That is why we have a component filtering before matching to list all keywords that can match the query.
Filtering component just removes some keywords that we are sure cannot match with the search query within a reasonable time. The fewer keywords that have to be checked to match the search query, the lower the latency.
This repo is the implementation of a simple filtering and indexing component in Search Ads system with multithreading in C++.
- Run on Linux only
- Indexing load the number keywords is 10M and it will load each 30 seconds once. The original data I get from Vietnamese Search Engine contains around 84k, so I expand some random keywords from original data until 10M
- More details about server here
- The number of queries used to benchmark is ~ 34k
- The number of threads used in filtering is 4, and the number of tasks is 100
To run this server, follow these steps:
- Clone this repository:
git clone [email protected]:truongcongthanh2000/Search-Ads-Filtering.git
- Navigate to the repository directory:
cd Search-Ads-Filtering
- Build the source code:
mkdir build
cd build
cmake .. && make
First, move file keywords.txt from data directory: cp ../data/keywords.txt .
to the build directory.
And then, to start the server, run ./apiGateway number_threads number_tasks
, ex: ./apiGateway 4 100
The server will start on port 8080
by default.
Access the demo API via: localhost:8080/filter_single_thread/?query=....
or localhost:8080/filter_multi_thread/?query=....
, just copy any query in data/queries.txt
and paste it into ?query=....
First, which also crucial thing to my implementation for filtering, is the definition of define keyword can be match with the search query. From what I see logic matching (ex: Google Ads), the keyword cannot be match with the search query if keyword contains any token, which include a collection of 1 word or more words to create a meaningful phrase, that not belong to the search query. For example:
- keyword =
ghế massage
can match with query =ghế massage giá tốt nhất
because tokenize(ghế massage
) ={ghế, massage}
$\subset$ tokenize(ghế massage giá tốt nhất
) = {ghế, massage, giá, tốt nhất
} - keyword =
iphone 15
cannot match with query =giá samsung s24 ultra
becauseiphone
or15
does not belong in query
In indexing step, for each keyword, I create a map<token, list<keywords>>
to list all keywords that contains the token. When serve api with search query, list all tokens of the query and use map in indexing to counting all keywords. After all, I will list all keyword that all tokens of the keyword has been counted. Here is the pseucode
n keywords:
keyword[i] -> {token_1, token_2, .., token_m} ->
map_token_keywords[token_1] = [keyword[i] ... ]
map_token_keywords[token_2] = [keyword[i] ... ]
map_keyword_tokens_size[keyword[i]] = m
query_user -> query_tokens = {token_1, token_2, ..., token_t}
Filter(map_token_keywords, map_keyword_tokens_size, query_tokens): list[keyword]
count = map[keyword][int] // store upto n keywords
for token in query_tokens:
for keyword in map_token_keywords[token]:
count[keyword] += 1
ans = []
for keyword, cnt in count:
if cnt == map_keyword_tokens_size[keyword]:
ans = append(ans, keyword)
return ans
Time complexity: O(n * t) in the worst case
Space complexity: O(n)
Optimize 1: The above code has cons in storing huge keywords, which can be all keywords in the worst case. So I divide all n keywords into sqrt(n) blocks, each containing sqrt(n) keywords. When traversing each block, I use Sliding window technique for each token in the query to keep track of the current block. So I just need to use a counting array with size sqrt(n) to count the occurrence for each keyword.
Time complexity: O(n * t) in the average case
Space complexity: O(sqrt(n))
Optimize 2: As you can see that, the keywords are completely independent of each other, so I applied a Square Root (Sqrt) Decomposition Algorithm in optimize 1. In addition, we can also apply parallel programming. Here I use a threadpool with the number of threads being 4 and the number of tasks being 100, with each task handling about
I code simple benchmark with python, the code is located in the data folder, along with the queries.txt file used to send http requests.
- On the Laptop with 16 cores and 16GB RAM. Here is the result
Type: single_thread
Total requests: 34303
Total time: 31689.58 seconds
Requests per second (RPS): 1.08
Type: multi_thread
Total requests: 34303
Total time: 8419.94 seconds
Requests per second (RPS): 4.07
Essentially, multithreading is 4 times faster than single threads
It hard to have this repo without contributions from public repository, I especially thank: