diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 8cf4de4..be5396c 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -13,6 +13,7 @@ - [columnstores vs rowstores](./databases/column-stores-vs-row-stores.md) - [kv](./databases/kv/README.md) - [rocksdb cidr17](./databases/kv/optimizing-space-amplification-in-rocksdb.md) + - [wisckey](./databases/kv/wisckey.md) - [mmdb](./databases/mmdb/README.md) - [mmdb overview](./databases/mmdb/overview.md) - [oltp](./databases/oltp/README.md) diff --git a/src/assets/images/wisckey_lsmtree_leveldb.png b/src/assets/images/wisckey_lsmtree_leveldb.png new file mode 100644 index 0000000..371a29d Binary files /dev/null and b/src/assets/images/wisckey_lsmtree_leveldb.png differ diff --git a/src/assets/pdfs/WiscKey.pdf b/src/assets/pdfs/WiscKey.pdf new file mode 100644 index 0000000..7866e52 Binary files /dev/null and b/src/assets/pdfs/WiscKey.pdf differ diff --git a/src/databases/README.md b/src/databases/README.md index f945ce0..b29bff7 100644 --- a/src/databases/README.md +++ b/src/databases/README.md @@ -5,6 +5,7 @@ - **[Main Memory Database Systems: An Overview][mmdb-overview]** - **[KV databases](kv/index.html)** - **[Optimizing Space Amplification in RocksDB][rocksdb-cidr17]** + - **[WiscKey: Separating Keys from Values in SSD-Conscious Storage][wisckey]** - **[OLTP](oltp/index.html)** - **[OLTP Through the Looking Glass, and What We Found There][lookingglass]** - **[Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores][staring-abyss]** @@ -51,3 +52,4 @@ [hnsw]: vectordb/hnsw.md [pq]: vectordb/pq.md [ivf-hnsw]: vectordb/ivf-hnsw.md +[wisckey]: kv/wisckey.md diff --git a/src/databases/kv/README.md b/src/databases/kv/README.md index 4cf1943..514b4d2 100644 --- a/src/databases/kv/README.md +++ b/src/databases/kv/README.md @@ -2,6 +2,8 @@ - **[Optimizing Space Amplification in RocksDB][rocksdb-cidr17], 2017** - **[Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience][fast21-dong], 2021 [a slightly more detailed version](/assets/pdfs/rocksdb-evolution-2021.pdf)** +- **[WiscKey: Separating Keys from Values in SSD-Conscious Storage][wisckey], 2016** [rocksdb-cidr17]: optimizing-space-amplification-in-rocksdb.md [fast21-dong]: /assets/pdfs/fast21-dong.pdf +[wisckey]: wisckey.md diff --git a/src/databases/kv/wisckey.md b/src/databases/kv/wisckey.md new file mode 100644 index 0000000..f070e12 --- /dev/null +++ b/src/databases/kv/wisckey.md @@ -0,0 +1,58 @@ +### [WiscKey: Separating Keys from Values in SSD-Conscious Storage](/assets/pdfs/WiscKey.pdf) + +> FAST ’16 +> +> https://dl.acm.org/doi/10.1145/3033273 + +I/O 放大是 LSM-tree 固有的特性之一。新写入的数据被追加到 WAL 日志并写入内存 memtable,这种操作避免了对已有数据的读取和修改,因而非常高效。当内存中的结构达到一定大小或达到一定时间间隔时,LSM-tree 会将这些数据批量写入到磁盘上的持久化存储结构中(SSTable),低层文件通过 compaction 生成更高层的文件。 + +> To maintain the size limit, once the total size of a level **Li** exceeds its limit, the compaction thread will choose **one** file from **Li**, merge +> sort with **all the overlapped files of Li+1**, and generate new **Li+1** SSTable files. + +![Figure 1: LSM-tree and LevelDB Architecture](/assets/images/wisckey_lsmtree_leveldb.png) + +**写放大(Write Amplification)** + +由于 Li 的大小限制是 Li-1 的 10 倍,因此 Li-1 层的文件合并到 Li 时,最坏情况下可能会读取 Li 中的 10 个文件,并在排序后将这些文件写回 Li,因此,将一个文件从一个层级移动到另一个层级的写放大系数最高可达到 10 倍。 + +> For a large dataset, since any newly generated table file can eventually migrate from L0 to L6 through a series of compaction steps, write amplification +> can be over 50 (10 for each gap between L1 to L6). + +**读放大(Read Amplification)** + +读放大的源头有两点: + +- 首先,为了查找一个键值对 LevelDB 可能需要检查多个层级(L0 8个 + L1~L6 各一个 = 14) +- 其次,为了在一个 SSTable 文件中找到一个键值对,LevelDB 需要读取文件中的多个元数据块(index block + bloom-filter blocks) + +假设查询一个不存在的 key,且 bloom-filter 过滤的结果是 `false positive`,对于查询 1KB 键值对的操作,需要读取一个 SSTable 的数据量为: 16-KB index block + 4-KB bloom-filter block + 4-KB data block = 24K。 + +> Considering the 14 SSTable files in the worst case, the read amplification of LevelDB is 24 × 14 = 336. Smaller key-value pairs will lead to an even higher read amplification. + +对于 HDD 盘,假设随机写延迟/顺序写延迟=1000,那么 LSM-tree 在写放大小于 1000 的情况下仍会比 B-Tree 快。这也是 LSM-tree 在传统 HDD 盘取得成功的重要原因。 + +SSD(固态硬盘)则有所不同,在设计键值存储系统时需要考虑三个重要的因素: + +- First, the difference between random and sequential performance is not nearly as large as with HDDs +- Second, SSDs have a large degree of internal parallelism +- Third, SSDs can wear out through repeated writes; the high write amplification in LSM-trees can significantly reduce device lifetime + +### WiscKey + +> The central idea behind WiscKey is the separation of keys and values; only keys are kept sorted in the LSM-tree, while values are stored separately in a log. + +WiscKey 包含四个关键的设计要点: + +- First, WiscKey separates keys from values, keeping only keys in the LSM-tree and the values in a separate log file +- Second, to deal with unsorted values (which necessitate random access during range queries), WiscKey uses the parallel random-read characteristic of SSD devices +- Third, WiscKey utilizes unique crash-consistency and garbage-collection techniques to efficiently manage the value log +- Finally, WiscKey optimizes performance by removing the LSM-tree log without sacrificing consistency, thus reducing system-call overhead from small writes + +一个很直观的感觉,LSM-tree 只保存 key 和 value 的位置(),在 value 很大的情况下,能避免对 value 进行 compaction,进而减少了写放大,节省了带宽,增加了 SSD 的使用周期。同时,LSM-tree 占用的存储也更少了,能更充分利用缓存和较少的层级减少读放大。 + +> Readers might assume that WiscKey will be slower than LevelDB for lookups, due to its extra I/O to retrieve the value. + +> However, since the LSM-tree of WiscKey is **much smaller** than LevelDB (for the same database size), a lookup may search **fewer levels** of table +> files in the LSM-tree and a significant portion of the LSM-tree can be easily **cached in memory**. + +当然 k/v 分离也带来了一些挑战,最直观的挑战是当写入随机的键值对,然后进行 Range Query,导致查询到的 value 分布在 log file 的不同位置,文章提出用并发读来提高带宽。作者对其它挑战( Garbage Collection & )也都提出了相应的解决方案,感兴趣的读者请阅读论文。