Skip to content

Commit

Permalink
WiscKey: Separating Keys from Values in SSD-Conscious Storage
Browse files Browse the repository at this point in the history
Signed-off-by: Zhao Junwang <[email protected]>
  • Loading branch information
zhjwpku committed Mar 13, 2024
1 parent a01ccaa commit 865a84c
Show file tree
Hide file tree
Showing 6 changed files with 63 additions and 0 deletions.
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
- [columnstores vs rowstores](./databases/column-stores-vs-row-stores.md)
- [kv](./databases/kv/README.md)
- [rocksdb cidr17](./databases/kv/optimizing-space-amplification-in-rocksdb.md)
- [wisckey](./databases/kv/wisckey.md)
- [mmdb](./databases/mmdb/README.md)
- [mmdb overview](./databases/mmdb/overview.md)
- [oltp](./databases/oltp/README.md)
Expand Down
Binary file added src/assets/images/wisckey_lsmtree_leveldb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/pdfs/WiscKey.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions src/databases/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
- **[Main Memory Database Systems: An Overview][mmdb-overview]**
- **[KV databases](kv/index.html)**
- **[Optimizing Space Amplification in RocksDB][rocksdb-cidr17]**
- **[WiscKey: Separating Keys from Values in SSD-Conscious Storage][wisckey]**
- **[OLTP](oltp/index.html)**
- **[OLTP Through the Looking Glass, and What We Found There][lookingglass]**
- **[Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores][staring-abyss]**
Expand Down Expand Up @@ -51,3 +52,4 @@
[hnsw]: vectordb/hnsw.md
[pq]: vectordb/pq.md
[ivf-hnsw]: vectordb/ivf-hnsw.md
[wisckey]: kv/wisckey.md
2 changes: 2 additions & 0 deletions src/databases/kv/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

- **[Optimizing Space Amplification in RocksDB][rocksdb-cidr17], 2017**
- **[Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience][fast21-dong], 2021 [a slightly more detailed version](/assets/pdfs/rocksdb-evolution-2021.pdf)**
- **[WiscKey: Separating Keys from Values in SSD-Conscious Storage][wisckey], 2016**

[rocksdb-cidr17]: optimizing-space-amplification-in-rocksdb.md
[fast21-dong]: /assets/pdfs/fast21-dong.pdf
[wisckey]: wisckey.md
58 changes: 58 additions & 0 deletions src/databases/kv/wisckey.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
### [WiscKey: Separating Keys from Values in SSD-Conscious Storage](/assets/pdfs/WiscKey.pdf)

> FAST ’16
>
> https://dl.acm.org/doi/10.1145/3033273
I/O 放大是 LSM-tree 固有的特性之一。新写入的数据被追加到 WAL 日志并写入内存 memtable,这种操作避免了对已有数据的读取和修改,因而非常高效。当内存中的结构达到一定大小或达到一定时间间隔时,LSM-tree 会将这些数据批量写入到磁盘上的持久化存储结构中(SSTable),低层文件通过 compaction 生成更高层的文件。

> To maintain the size limit, once the total size of a level **Li** exceeds its limit, the compaction thread will choose **one** file from **Li**, merge
> sort with **all the overlapped files of Li+1**, and generate new **Li+1** SSTable files.
![Figure 1: LSM-tree and LevelDB Architecture](/assets/images/wisckey_lsmtree_leveldb.png)

**写放大(Write Amplification)**

由于 Li 的大小限制是 Li-1 的 10 倍,因此 Li-1 层的文件合并到 Li 时,最坏情况下可能会读取 Li 中的 10 个文件,并在排序后将这些文件写回 Li,因此,将一个文件从一个层级移动到另一个层级的写放大系数最高可达到 10 倍。

> For a large dataset, since any newly generated table file can eventually migrate from L0 to L6 through a series of compaction steps, write amplification
> can be over 50 (10 for each gap between L1 to L6).
**读放大(Read Amplification)**

读放大的源头有两点:

- 首先,为了查找一个键值对 LevelDB 可能需要检查多个层级(L0 8个 + L1~L6 各一个 = 14)
- 其次,为了在一个 SSTable 文件中找到一个键值对,LevelDB 需要读取文件中的多个元数据块(index block + bloom-filter blocks)

假设查询一个不存在的 key,且 bloom-filter 过滤的结果是 `false positive`,对于查询 1KB 键值对的操作,需要读取一个 SSTable 的数据量为: 16-KB index block + 4-KB bloom-filter block + 4-KB data block = 24K。

> Considering the 14 SSTable files in the worst case, the read amplification of LevelDB is 24 × 14 = 336. Smaller key-value pairs will lead to an even higher read amplification.
对于 HDD 盘,假设随机写延迟/顺序写延迟=1000,那么 LSM-tree 在写放大小于 1000 的情况下仍会比 B-Tree 快。这也是 LSM-tree 在传统 HDD 盘取得成功的重要原因。

SSD(固态硬盘)则有所不同,在设计键值存储系统时需要考虑三个重要的因素:

- First, the difference between random and sequential performance is not nearly as large as with HDDs
- Second, SSDs have a large degree of internal parallelism
- Third, SSDs can wear out through repeated writes; the high write amplification in LSM-trees can significantly reduce device lifetime

### WiscKey

> The central idea behind WiscKey is the separation of keys and values; only keys are kept sorted in the LSM-tree, while values are stored separately in a log.
WiscKey 包含四个关键的设计要点:

- First, WiscKey separates keys from values, keeping only keys in the LSM-tree and the values in a separate log file
- Second, to deal with unsorted values (which necessitate random access during range queries), WiscKey uses the parallel random-read characteristic of SSD devices
- Third, WiscKey utilizes unique crash-consistency and garbage-collection techniques to efficiently manage the value log
- Finally, WiscKey optimizes performance by removing the LSM-tree log without sacrificing consistency, thus reducing system-call overhead from small writes

一个很直观的感觉,LSM-tree 只保存 key 和 value 的位置(<vLog-offset, value-size>),在 value 很大的情况下,能避免对 value 进行 compaction,进而减少了写放大,节省了带宽,增加了 SSD 的使用周期。同时,LSM-tree 占用的存储也更少了,能更充分利用缓存和较少的层级减少读放大。

> Readers might assume that WiscKey will be slower than LevelDB for lookups, due to its extra I/O to retrieve the value.
> However, since the LSM-tree of WiscKey is **much smaller** than LevelDB (for the same database size), a lookup may search **fewer levels** of table
> files in the LSM-tree and a significant portion of the LSM-tree can be easily **cached in memory**.
当然 k/v 分离也带来了一些挑战,最直观的挑战是当写入随机的键值对,然后进行 Range Query,导致查询到的 value 分布在 log file 的不同位置,文章提出用并发读来提高带宽。作者对其它挑战( Garbage Collection & )也都提出了相应的解决方案,感兴趣的读者请阅读论文。

0 comments on commit 865a84c

Please sign in to comment.