Chapter 8: Slab Allocator

The kernel's 'general purpose' allocator is a 'slab allocator'. The basic concept is to have caches of commonly used objects kept in an initialised state, available for use by the kernel.
Without an object-based allocator, the kernel would spend a lot of the time allocating, initialising and freeing the same object over and over again - the slab allocator aims to cache freed objects so they can be reused.
The slab allocator consists of a variable number of caches linked together on a doubly-linked circular list called a 'cache chain'.
In the context of the slab allocator a cache is a manager for a number of objects of a particular type, e.g. a struct mm_struct cache, each of which are managed by a struct kmem_cache_s.
Each cache maintains blocks of contiguous pages in memory called 'slabs' that are carved into small chunks for the objects that the cache manages:

-------------          ---------          -------------
| lastcache |--------->| cache |--------->| nextcache |
-------------          ---------          -------------
                        |  |  |
                        |  |  |
          /-------------/  |  \--------------\
          |                |                 |
          v                v                 v
   --------------  -----------------  --------------
   | slabs_full |  | slabs_partial |  | slabs_free |
   --------------  -----------------  --------------
          |                |                 |
          |                |                 |
          |                |                 |
          |                |                 |
          v                v                 v
      ---------        ---------         ---------
      | slabs |        | slabs |         | slabs |
      ---------        ---------         ---------
          |                |                 |
          |                |                 |
          |                |                 |
          |                |                 |
          v                v                 v
      ---------        ---------         ---------
      | pages |        | pages |         | pages |
      ---------        ---------         ---------
          |                |                 |
          |                |                 |
      /-------\        /-------\         /-------\
      |       |        |       |         |       |
      v       v        v       v         v       v
   ------- -------  ------- -------   ------- -------
   | obj | | obj |  | obj | | obj |   | obj | | obj |
   ------- -------  ------- -------   ------- -------

The slab allocator has 3 principle aims:

The allocation of small blocks of memory to help eliminate internal fragmentation that would be otherwise caused by the buddy system.
The caching of commonly used objects so the system does not waste time allocating, initialising and destroying objects.
Better use of the hardware cache by aligning objects to the L1 or L2 caches.

To help eliminate internal fragmentation normally caused by a binary buddy allocator, two sets of caches of small memory buffers ranging from 2^5 (32) bytes to 2^17 (131,072) bytes are maintained.
One cache set is suitable for use with DMA devices, the other not, these are called 'size-N(DMA)' and 'size-N' respectively, where N is the size of the allocation.
The function kmalloc() is provided for allocating to these (see 8.4.1.) Additionally, the sizes of the caches are discussed in further detail in 8.4.
The slab allocator also maintains caches of commonly used objects - for many structures used in the kernel, the time needed to initialise an object is comparable with, or exceeds, the cost of allocating space for it.
When a new slab is created, a number of objects are packed into it and initialised using a constructor if available. When an object is freed, it is left in its initialised state so that object allocation will be quick.
The final task of the slab allocator is to optimise the use of hardware caches. In order to do this, 'slab colouring' is performed - if there is space left over after objects are packed into a slab, the remaining space is used to colour the slab.
Slab colouring is a scheme that attempts to ensure that objects in different slabs use different lines in the cache. By placing objects at different starting offsets within the slab, they will likely use different lines in the CPU cache which helps ensure that objects from the same slab cache will be unlikely to flush each other.
With this scheme, space that would otherwise be wasted is used to fulfil a new function. Diagrammatically:

                             ------------------
                             |                |
                             |                |
                             |                |
                             |                |
                             |  Rest of page  |
 ----------------------      |  filled with   |
 |   kmem_getpages()  |      |  objects and   |
 ----------------------      |    padding     |
            |                |                |
            |                |                |  ----------------------
            |                |                |  | kmem_cache_alloc() |
            v                |                |  ----------------------
 ----------------------      |----------------|             |
 | __get_free_pages() |      | Colour padding |             |
 ----------------------      |----------------|             |
            |                |                |<------------/
            |                |  Initialised   |                 -----------------------
            |                |     Object     |                 | Object allocated to |
            v                |                |---------------->| requesting process  |
 ----------------------      |----------------| ^               -----------------------
 |   Allocates pages  |      | Colour padding | |
 |      of order      |      |----------------| |
 | cachep->gfp_order  |      |                | |   L1 CPU
 ----------------------      |  Initialised   | | Cache Size
            |                |     Object     | |
            |                |                | |
            \--------------->|----------------- v

Linux does not attempt to colour page allocations based on the physical address, or to order where objects are placed. The details of cache colours is discussed further in 8.1.5.
On an SMP system, a further step is taken to help cache utilisation where each cache has a small array of objects reserved for each CPU - this is discussed in more detail in 8.5.
The slab allocator allows for slab debugging if CONFIG_SLAB_DEBUG is set. This adds 'red zoning' and 'objection poisoning'.
Red zoning adds a market at either end of the object - if it is disturbed the allocator knows which object encountered a buffer overflwo, and reports it.
Poisoning an object will fill it with a predefined bit pattern (POISON_BYTE which is defined as 0x5a) at slab creation and after free. On each allocation this pattern is checked - if it is changed, the allocator knows the object was used before it was allocated, and flags it.
Taking a look at the slab allocator's API:

kmem_cache_create() - Creates a new cache and adds it to the cache chain.
kmem_cache_reap() - Scans at most REAP_SCANLEN caches and selects one for reaping all per-CPU objects and free slabs from. This is called when memory is tight.
kmem_cache_shrink() - Deletes all per-CPU objects associated with a cache and deletes all slabs in the kmem_cache_s->slabs_free list. It returns the number of freed pages.
kmem_cache_alloc() (and subsequently __kmem_cache_alloc()) - Allocates a single object from the cache and returns it to the caller.
kmem_cache_free() - Frees an object and returns it to the cache.
kmalloc() - Allocates a block of memory from one of the specified sizes cache.
kfree() - Frees a block of memory allocated with kmalloc().
kmem_cache_destroy() - Destroys all objects in all slabs and frees up all associated memory before removing the cache from the chain.

8.1 Caches

One cache exists for each type of object that is to be cached. To see a list of caches available on a running system /proc/slabinfo can be queried, e.g. (from my local 2.4.22 VM system:

morgoth:~# cat /proc/slabinfo
slabinfo - version: 1.1 (SMP)
kmem_cache            64     64    244    4    4    1 :  252  126
tcp_tw_bucket          0      0     96    0    0    1 :  252  126
tcp_bind_bucket        0      0     32    0    0    1 :  252  126
tcp_open_request       0      0     64    0    0    1 :  252  126
inet_peer_cache        0      0     64    0    0    1 :  252  126
ip_fib_hash          113    113     32    1    1    1 :  252  126
ip_dst_cache          48     48    160    2    2    1 :  252  126
arp_cache             30     30    128    1    1    1 :  252  126
blkdev_requests     3080   3080     96   77   77    1 :  252  126
nfs_write_data         0      0    352    0    0    1 :  124   62
nfs_read_data          0      0    352    0    0    1 :  124   62
nfs_page               0      0     96    0    0    1 :  252  126
journal_head         234    234     48    3    3    1 :  252  126
revoke_table         126    253     12    1    1    1 :  252  126
revoke_record          0      0     32    0    0    1 :  252  126
dnotify_cache          0      0     20    0    0    1 :  252  126
file_lock_cache       80     80     96    2    2    1 :  252  126
fasync_cache           0      0     16    0    0    1 :  252  126
uid_cache            226    226     32    2    2    1 :  252  126
skbuff_head_cache    384    384    160   16   16    1 :  252  126
sock                  54     54    832    6    6    2 :  124   62
sigqueue              58     58    132    2    2    1 :  252  126
kiobuf                 0      0     64    0    0    1 :  252  126
cdev_cache           118    118     64    2    2    1 :  252  126
bdev_cache            59     59     64    1    1    1 :  252  126
mnt_cache            118    118     64    2    2    1 :  252  126
inode_cache          413    413    512   59   59    1 :  124   62
dentry_cache         570    570    128   19   19    1 :  252  126
filp                 150    150    128    5    5    1 :  252  126
names_cache            4      4   4096    4    4    1 :   60   30
buffer_head         2360   2360     96   59   59    1 :  252  126
mm_struct             48     48    160    2    2    1 :  252  126
vm_area_struct       400    400     96   10   10    1 :  252  126
fs_cache             118    118     64    2    2    1 :  252  126
files_cache           36     36    416    4    4    1 :  124   62
signal_act            27     27   1312    9    9    1 :   60   30
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536             0      0  65536    0    0   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768             0      0  32768    0    0    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384             0      0  16384    0    0    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192              4      4   8192    4    4    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096            267    267   4096  267  267    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048              8      8   2048    4    4    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024            108    108   1024   27   27    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512              56     56    512    7    7    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256             105    105    256    7    7    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128             510    510    128   17   17    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              177    177     64    3    3    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32              565    565     32    5    5    1 :  252  126

Each of the column fields correspond to a field in struct kmem_cache_s. Looking at each one in order:

cache-name - A human-readable name e.g. 'tcp_bind_bucket'
num-active-objs - Number of objects that are in use.
total-objs - Number of available objects, including unused.
obj-size - The size of each object, typically small.
num-active-slabs - Number of slabs containing objects that are active.
total-slabs - Total number of slabs.
num-pages-per-slab - The pages required to create one slab, typically 1.
limit (SMP systems, after the colon) - The number of free objects the pool can have before half of it is given to the global free pool.
batchcount (SMP systems, after the colon) - The number of objects allocated for the processor in a block when no objects are free.

Columns 8 and 9 are SMP-specific (shown after the colon in the /proc/slabinfo output) and refer to the per-CPU cache which is explored further in 8.5.
To speed allocation and freeing of objects and slabs, they are arranged into 3 lists - slabs_full, slabs_partial and slabs_free. Entries in slabs_full have all of their objects in use. Entries in slabs_partial have free objects available so is a prime candidate for object allocation and finally slabs_free entries have no allocated objects, so is a candidate for slab destruction.

8.1.1 Cache Descriptor

All information describing a cache is stored in a struct kmem_cache_s. This is a rather large data structure:

struct kmem_cache_s {
/* 1) each alloc & free */
        /* full, partial first, then free */
        struct list_head        slabs_full;
        struct list_head        slabs_partial;
        struct list_head        slabs_free;
        unsigned int            objsize;
        unsigned int            flags;  /* constant flags */
        unsigned int            num;    /* # of objs per slab */
        spinlock_t              spinlock;
#ifdef CONFIG_SMP
        unsigned int            batchcount;
#endif

/* 2) slab additions /removals */
        /* order of pgs per slab (2^n) */
        unsigned int            gfporder;

        /* force GFP flags, e.g. GFP_DMA */
        unsigned int            gfpflags;

        size_t                  colour;         /* cache colouring range */
        unsigned int            colour_off;     /* colour offset */
        unsigned int            colour_next;    /* cache colouring */
        kmem_cache_t            *slabp_cache;
        unsigned int            growing;
        unsigned int            dflags;         /* dynamic flags */

        /* constructor func */
        void (*ctor)(void *, kmem_cache_t *, unsigned long);

        /* de-constructor func */
        void (*dtor)(void *, kmem_cache_t *, unsigned long);

        unsigned long           failures;

/* 3) cache creation/removal */
        char                    name[CACHE_NAMELEN];
        struct list_head        next;
#ifdef CONFIG_SMP
/* 4) per-cpu data */
        cpucache_t              *cpudata[NR_CPUS];
#endif
#if STATS
        unsigned long           num_active;
        unsigned long           num_allocations;
        unsigned long           high_mark;
        unsigned long           grown;
        unsigned long           reaped;
        unsigned long           errors;
#ifdef CONFIG_SMP
        atomic_t                allochit;
        atomic_t                allocmiss;
        atomic_t                freehit;
        atomic_t                freemiss;
#endif
#endif
};

Looking at each field in turn:

slabs_[full, partial, free] - These are the three lists where the slabs are stored as discussed above.
objsize - Size of each object packed into the slab.
flags - Flags that determine how parts of the allocator will behave when dealing with the cache. Discussed in 8.1.2
num - This is the number of objects contained in each slab.
spinlock - Protects the structure from concurrent accesses.
batchcount - Number of objects that will be allocated in batch for the per-cpu caches, as discussed above.
gfporder - This indicates the size of the slab in pages. Each slab consumes 2^gfporder pages, because these are the allocation sizes that the buddy allocator provides.
gfpflags - The GFP flags used when calling the buddy allocator to allocate pages. See 6.4 for more details.
colour - The number of different offsets that can be used for cache colouring (cache colouring is discussed further in 8.1.5.)
colour_off - The byte alignment to keep slabs at. For example, slabs for the size-X caches are aligned on the L1 cache.
colour_next - The next colour line to use. Wraps back to 0 when it reaches colour.
slabp_cache - Cache for off-slab slab management objects (see 8.2.1.)
growing - Set to indicate if the cache is growing or not. If so, it is much less likely that this cache will be selected to reap free slabs under memory pressure.
dflags - Dynamic flags that change during cache lifetime (see 8.1.3.)
ctor - Callback that can be called to initialise each new object, defaults to NULL meaning no such function is called.
dtor - Similar to ctor but for object destruction.
failures - Unused - Set to 0 then ignored. We have no failures bitches!
name - Human-readable name for the cache.
next - struct list_head field for next cache on the cache line.
cpudata - Per-CPU data. Discussed further in 8.5.
num_active - (debug) - The current number of active objects
num_allocations - (debug) - Running total of the number of objects that have been allocated on this cache.
high_mark - (debug) - The highest value num_active has had, to date.
grown - (debug) - The number of times kmem_cache_grow() has been called.
reaped - (debug) - Number of times this cache has been reaped.
errors - (debug, unused)
allochit - (debug) - Total number of times an allocation has used the per-CPU cache.
allocmiss - (debug) - Total number of times an allocation has missed the per-CPU cache.
freehit - (debug) - Total number of times a freed object was placed on a per-CPU cache.
freemiss - (debug) - Total number of times an object was freed and placed in the global pool.

The statistical fields (num_active, etc. behind the #if STATS) are only available if CONFIG_SLAB_DEBUG is enabled. /proc/slabinfo does not rely on these being present, rather it calculates the values when the proc entry is read by another process by examining every slab used by each cache.

8.1.2 Cache Static Flags

A number of flags are set at cache creation time that remain the same for the lifetime of the cache and affect how the slab is structured and how objects are stored within it.
All of the flags are stored in a bitmask in the flags field of the cache descriptor.
Looking at each of the internal flags defined in mm/slab.c:

CFLGS_OFF_SLAB - Indicates that the slab managers for this cache are kept off-slab. Discussed further in 8.2.1.
CFLGS_OPTIMIZE - Only set, but never used. We always optimise, baby!

Looking at each of the flags that are set by the cache creator:

SLAB_HWCACHE_ALIGN - Align objects to the L1 CPU cache.
SLAB_MUST_HWCACHE_ALIGN - Force alignment to the L1 CPU cache even if it is very wasteful or slab debugging is enabled.
SLAB_NO_REAP - Never reap slabs in this cache.
SLAB_CACHE_DMA - Allocate slabs with memory from ZONE_DMA.

Finally, let's consider some debug flags, which require CONFIG_SLAB_DEBUG to be set:

SLAB_DEBUG_FREE - Perform expensive checks on free.
SLAB_DEBUG_INITIAL - On free, call the constructor as a verifier to ensure the object is still initialised correctly.
SLAB_RED_ZONE - Places a marker at either end of objects to trap overflows.
SLAB_POISON - Poison objects with a known pattern for trapping changes made to objects not allocated or initialised (see above)

To prevent callers from using the wrong flags, a CREATE_MASK is provided that 'masks in' all the allowable flags. When a cache is being created, the requested flags are compared against the CREATE_MASK and reported as a bug if invalid flags are used.

8.1.3 Cache Dynamic Flags

The dflags field has only one flag - DFLGS_GROWN - but it is important. It is set during kmem_cache_grow() so that kmem_cache_reap() will be unlikely to choose the cache for reaping.
When kmem_cache_reap() finds a cahce with DFLGS_GROWN set it skips it, removing the flag int he process.

8.1.4 Cache Allocation Flags

These flags correspond to the GFP page flag options for allocating pages for slabs. Some callers use GFP_... and SLAB_... flags interchangeably, however this is incorrect and only SLAB_... flags should be used:

SLAB_ATOMIC
SLAB_DMA
SLAB_KERNEL
SLAB_NFS
SLAB_NOFS
SLAB_NOHIGHIO
SLAB_NOIO
SLAB_USER

A small number of flags may be passed to constructor and destructor functions:

SLAB_CTOR_CONSTRUCTOR - Set if the function is being called as a constructor for caches that use the same function for constructors and destructors.
SLAB_CTOR_ATOMIC - Indicates that the constructor may not sleep.
SLAB_CTOR_VERIFY - Indicates that the constructor should just verify that the object is initialised properly.

8.1.5 Cache Colouring

As discussed previously, the slab allocator attempts to better make use of the hardware cache by offsetting objects in different slabs by differing amounts depending on the amount of space left over in the slab.
The offset is in units of BYTES_PER_WORD unless the SLAB_HWCACHE_ALIGN flag is set, in which case it is aligned to blocks of L1_CACHE_BYTES for alignment to the L1 hardware cache.
During cache creation the number of objects that can fit on a slab (see 8.2.7) and how many bytes would be wasted are calculated. Based on wastage, 2 figures are calculated for the cache descriptor:

colour - The number of different offsets that can be used.
colour_off - The multiple to offset each object in the slab.

With the objects offset, they will use different lines on the associative hardware cache. This makes it less likely for objects from slabs to cause flushes for another object.
Consider an example - Let's say that s_mem (the address of the first object on the slab) is 0, alignment is to 32 bytes for the L1 hardware cache, and 100 bytes are wasted on the slab.
In this scenario, the first slab created will have its objects start at 0. The second will start at 32, the third at 64, the fourth at 96 and the fifth will start back at 0.
This makes it unlikely objects from each of the slabs will hit the same hardware cache line on the CPU.
In this example, colour would be 3 (there are 3 offsets - 32, 64, 96), and colour_off would be 32.

8.1.6 Cache Creation

The function kmem_cache_create() is responsible for creating new caches and adding them to the cache chain. The tasks that are performed to create a cache are:

Perform basic sanity checks to avoid bad usage.
Perform debugging checks if CONFIG_SLAB_DEBUG is set.
Allocate a kmem_cache_t (typedef'd from a struct kmem_cache_s) from the cache_cache slab cache.
Align the object size to the word size.
Calculate how many objects will fit on a slab.
Align the object to the hardware cache.
Calculate colour offsets.
Initialise remaining fields in the cache descriptor.
Add the new cache to the cache chain.

8.1.7 Cache Reaping

When a slab is freed, it is placed on the slabs_free list for future use. Caches do not automatically shrink themselves, so, when kswapd notices memory is tight, it calls kmem_cache_reap() to free some memory.
kmem_cache_reap is responsible for for selecting a cache which will be required to shrink its memory usage.
Note that cache reaping does not take into account what memory node or zone is under pressure. This means that with a NUMA or high memory machine it is possible the kernel will spend a lot of time freeing memory from regions that are under no memory pressure. This is not a problem for an architecure like i386 which has only 1 bank of memory.
In the event that the system has a large number of caches, only REAP_SCANLEN (10 by default) caches are examined in each call.
The last cache to be scanned is stored in the variable clock_searchp to avoid examining the same caches repeatedly.
For each scanned cache, the reaper:

Checks flags for SLAB_NO_REAP and skip if set.
If the cache is growing, skip it.
If the cache has grown recently, or is currently growing, DFLGS_GROWN will be set. If this flag is set, the slab is skipped, but the flag is cleared so the cache will be a reap candidate the next time round.
Count the number of free slabs in the slabs_free field and calculate how many pages reaping this cache would free, storing in the local variable pages.
If the cache has constructors or large slabs, adjust pages to make it less likely for the cache to be selected.
If the number of pages that would be free exceeds REAP_PERFECT (defaults to 10), free half of the slabs in slabs_free.
Otherwise, scan the rest of the caches and select the one that would free the most pages if the function was to free half of the slabs in slabs_free.

8.1.8 Cache Shrinking

When a cache is selected to shrink itself, the steps it takes are simple and brutal:

Delete all objects in the per-CPU caches.
Delete all slabs from the slabs_free field unless the growing flag gets set.

Linux is nothing, if not subtle :) Two varieties of shrink functions are provided with confusingly similar names - kmem_cache_shrink() and __kmem_cache_shrink().
kmem_cache_shrink() removes all slabs from the cache's slabs_free field and returns the number of pages freed as a result - it's the principle shrinking function exported for use by slab allocator users.
__kmem_cache_shrink() frees all slabs from the cache's slabs_free field and then verifies that slabs_partial and slabs_full are empty. This is for internal use only, and used during cache destruction when it doesn't matter how many pages are freed, just that the cache is empty.

8.1.9 Cache Destroying

When a module is unloaded, it is responsible for destroying any cache via kmem_cache_destroy(). It is vital that the cache is properly destroyed because two caches of the same human-readable name are not allowed to exist.
Core kernel systems often don't bother to destroy its caches because they last for the entire life of the system.
The steps taken to destroy a cache are:

Delete the cache from the cache chain.
Shrink the cache to delete all slabs.
Free any per-CPU caches via kfree().
Delete the cache descriptor from the cache_cache.

8.2 Slabs

struct slab_s (typedef'd to slab_t) describes a slab. It is much simpler than the cache descriptor (though the way it's arranged is considerably more complex):

/*
 * slab_t
 *
 * Manages the objs in a slab. Placed either at the beginning of mem allocated
 * for a slab, or allocated from an general cache.
 * Slabs are chained into three list: fully used, partial, fully free slabs.
 */
typedef struct slab_s {
        struct list_head        list;
        unsigned long           colouroff;
        void                    *s_mem;         /* including colour offset */
        unsigned int            inuse;          /* num of objs active in slab */
        kmem_bufctl_t           free;
} slab_t;

Looking at each field:

list - The linked list the slab belongs to - it'll be one of slab_[full|partial|free] from the cache manager.
colouroff - The colour offset from the base address of the first object within the slab. The address of the first object is therefore s_mem + colouroff.
s_mem - The starting address of the first object within the slab.
inuse - The number of active objects in the slab.
free - The index of the next free object in the array of kmem_bufctl_t's (kmem_bufctl_t is typedef'd to unsigned int) that starts after the end of slab_t (slab_t is always kept at the start of a page frame so there is space for more data afterwards.) Discussed further in 8.2.3.

There is no obvious way to determine which cache a slab belongs to or vice-versa. It turns out that we can use the list field of the struct page (see 2.5 - the list field is a generic field that can be used for different purposes, which is quite surprising :)
In order to set/get the page's cache/slab you can use the SET_PAGE_CACHE(), SET_PAGE_SLAB(), GET_PAGE_CACHE(), and GET_PAGE_SLAB() macros which simply place caches in the next field of the list and slabs in the prev field:

/* Macros for storing/retrieving the cachep and or slab from the
 * global 'mem_map'. These are used to find the slab an obj belongs to.
 * With kfree(), these are used to find the cache which an obj belongs to.
 */
#define SET_PAGE_CACHE(pg,x)  ((pg)->list.next = (struct list_head *)(x))
#define GET_PAGE_CACHE(pg)    ((kmem_cache_t *)(pg)->list.next)
#define SET_PAGE_SLAB(pg,x)   ((pg)->list.prev = (struct list_head *)(x))
#define GET_PAGE_SLAB(pg)     ((slab_t *)(pg)->list.prev)

Diagrammatically (for a case where a slab is on the slabs_partial list):

      struct kmem_cache_s (=kmem_cache_t)
  /-->-----------------------------------
  |   |   struct list_head *slabs_full  |
  |   |---------------------------------|
  |   | struct list_head *slabs_partial |-------------\
  |   |---------------------------------|             |
  |   /                                 /             v
  |   \               ...               \      ---------------
  |   /                                 /      | Linked list | More slab_t's...
  |   -----------------------------------      |  of slab_t  |- - >
  |                                            ---------------
  |                                                   |
  |                                                   |
  |                                                   |
  |               struct page                         v
  |   next ------------------------- prev        ----------
  \--------| struct list_head list |------------>| slab_t |
           |-----------------------|             ----------
           /                       /
           \          ...          \
           /                       /
           |-----------------------|
           |   flags with PG_slab  |
           |        bit set        |
           |-----------------------|
           /                       /
           \          ...          \
           /                       /
           -------------------------

The last issue is where the slab management data structure slab_t is kept. They are either kept on-slab or off-slab depending on whether CFLGS_OFF_SLAB is set in the static flags.
The size of the object determines whether slab_t is stored on or off-slab. In the above diagram it's possible for slab_t to be kept at the beginning of the page frame, though the diagram implies it is kept separate, this is not necessarily the case.

8.2.1 Storing the Slab Descriptor

If the objects are larger than a threshold (PAGE_SIZE>>3 so 512 bytes on i386), CFLGS_OFF_SLAB is set in the cache flags and the slab descriptor/manager/management data structure is kept off-slab in one of the 'sizes caches' (see 8.4.)
The selected cache will be large enough to contain the slab_t and kmem_cache_slabmgmt() allocates from it as necessary.
Keeping the slab_t off-slab limits the number of objects that can be stored on the slab because there is limited space for the kmem_bufctl_t's (see 8.2.3) - however that shouldn't matter because the objects are large so there should be fewer stored on a single slab.
If the slab manager is kept on-slab it is reserved at the beginning of the slab. Enough space is kept to store the slab_t and the kmem_bufctl_ts.
The kmem_bufctl_ts are an array of unsigned integers used for tracking the index of the next free object available for use, see 8.2.3 for more details.
The actual objects in the slab are stored after the kmem_bufctl_t array.
A slab with on-slab descriptor, diagrammatically (again, assuming the slab we're looking at lives in slabs_partial):

struct kmem_cache_s (=kmem_cache_t)
-----------------------------------
|   struct list_head *slabs_full  |
|---------------------------------|
| struct list_head *slabs_partial |-------------\
|---------------------------------|             |
/                                 /             v
\               ...               \      ---------------
/                                 /      | Linked list | More slab_t's...
-----------------------------------      |  of slab_t  | - - >
                                         ---------------
                                                |
                                                |
   /--------------------------------------------/
   |
   |
   v        Page frame
   ---------------------------
   | struct list_head list   |
   |-------------------------|
   | unsigned long colouroff |
   |-------------------------|
/--|       void *s_mem       |
|  |-------------------------|
|  |   unsigned int inuse    |
|  |-------------------------|
|  |    kmem_bufctl_t free   |
|  |-------------------------|
|  /                         /
|  \  kmem_bufctl_t array... \
|  /                         /
\->|-------------------------|
   |         Object          |
   |-------------------------|
   |         Object          |
   |-------------------------|
   |         Object          |
   |-------------------------|
   /                         /
   \       More objects...   \
   /                         /
   ---------------------------

A slab with off-slab descriptor, diagrammatically (again, assuming the slab we're looking at lives in slabs_partial):

struct kmem_cache_s (=kmem_cache_t)
-----------------------------------
|   struct list_head *slabs_full  |
|---------------------------------|
| struct list_head *slabs_partial |-------------\
|---------------------------------|             |
/                                 /             v
\               ...               \      ---------------
/                                 /      | Linked list | More slab_t's...
-----------------------------------      |  of slab_t  | - - >
                                         ---------------
                                                |
                                                |
   /--------------------------------------------/
   |
   |
   v  Page frame for slab_t                                   Page frame for objects
   ---------------------------               /------------->---------------------------
   | struct list_head list   |               |              |         Object          |
   |-------------------------|               |              |-------------------------|
   | unsigned long colouroff |               |              |         Object          |
   |-------------------------|               |              |-------------------------|
   |       void *s_mem       |---------------/              |         Object          |
   |-------------------------|                              |-------------------------|
   |   unsigned int inuse    |                              /                         /
   |-------------------------|                              \       More objects...   \
   |    kmem_bufctl_t free   |                              /                         /
   |-------------------------|                              ---------------------------
   /                         /
   \  kmem_bufctl_t array... \
   /                         /
   ---------------------------

8.2.2 Slab Creation

So far we've seen how the cache is created, but not how the slabs_full, slabs_partial and slabs_free lists are grown, as they all start empty - i.e., how slabs are allocated.
New slabs are allocated to a cache via kmem_cache_grow() and is referred to as 'cache growing'.
Cache growing occurs when no objects are left in the slabs_partial list and when there are no slabs in slabs_free.
kmem_cache_grow() does the following:

Performs basic sanity checks to guard against bad usage.
Calculates the colour offset for objects in the slab.
Allocates memory for the slab and acquires a slab descriptor.
Links the pages used for the slab to the slab and cache descriptors (see 8.2.)
Initialises objects in the slab.
Adds the slab to the cache.

8.2.3 Tracking Free Objects

The slab allocator needs to have a quick and simple means of tracking where free objects are located on the partially-filled slabs. This is achieved by using an array of unsigned integers of type kmem_bufctl_t that is associated with each 'slab manager'.
The number of objects in the array is the same as the number of objects on the slab.
The array of kmem_bufctl_ts is kept after the slab descriptor in the page frame in which it is kept. In order to make life easier there is a macro which gives access to the array:

#define slab_bufctl(slabp) \
        ((kmem_bufctl_t *)(((slab_t*)slabp)+1))

The free field of the slab management structure contains the index of the next free object in the slab. When objects are allocated or freed, this is updated based on information in the kmem_bufctl_t array.

8.2.4 Initialising the kmem_bufctl_t Array

When a cache is grown, all the objects and the kmem_bufctl_t array on the slab are initialised.
The array is filled with the index of each object begining with 1 and ending with the marker BUFCTL_END, e.g. for a slab with five objects:

------------------------------
| 1 | 2 | 3 | 4 | BUFCTL_END |
------------------------------

The value 0 is stored in slab_t->free because the 0th object is the first free object to be used. The idea here is that for a given object i, the index of the next free object will be stored in slab_bufctl(slabp)[i].
Using the array this way makes it act like a LIFO (stack) for free objects.

8.2.5 Finding the Next Free Object

When allocating an object, kmem_cache_alloc() updates the kmem_bufctl_t array by calling kmem_cache_alloc_one_tail().
As discussed in 8.2.4, slab_t->free contains the index of the first free object and the index of the next free object is at slab_bufctl(slabp)[slab_t->free]. The code for navigating this (taken from kmem_cache_alloc_one_tail()) looks like:

void *objp;

...

objp = slabp->s_mem + slabp->free*cachep->objsize;
slabp->free=slab_bufctl(slabp)[slabp->free];

This is in effect 'popping' an index off the stack.
Note that the kmem_bufctl_t array is not changed during allocations, just updated on allocation/free. LS - there is some really confusing stuff here about unallocated elements being unreachable (after allocations?), skipping as I think there may be a mistake here.

8.2.6 Updating kmem_bufctl_t

The kmem_bufctl_t list is only updated when an object is freed via kmem_cache_free_one():

unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;

slab_bufctl(slabp)[objnr] = slabp->free;
slabp->free = objnr;

This is in effect 'pushing' an index back on to the stack.

8.2.7 Calculating the Number of Objects on a Slab

During cache creation kmem_cache_estimate() is called to calculate how many objects may be stored on a single slab, taking into account whether the slab descriptor is to be stored on or off-slab and the size taken up by each kmem_bufctl_t needed to track whether an object is free or not.
kmem_cache_estimate() returns the number objects that may be stored and how many bytes are wasted. The number of wasted bytes is important to know for cache colouring.
The calculation is fairly straightforward:

/* Cal the num objs, wastage, and bytes left over for a given slab size. */
static void kmem_cache_estimate (unsigned long gfporder, size_t size,
                 int flags, size_t *left_over, unsigned int *num)
{
        int i;
        size_t wastage = PAGE_SIZE<<gfporder;
        size_t extra = 0;
        size_t base = 0;

        if (!(flags & CFLGS_OFF_SLAB)) {
                base = sizeof(slab_t);
                extra = sizeof(kmem_bufctl_t);
        }
        i = 0;
        while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
                i++;
        if (i > 0)
                i--;

        if (i > SLAB_LIMIT)
                i = SLAB_LIMIT;

        *num = i;
        wastage -= i*size;
        wastage -= L1_CACHE_ALIGN(base+i*extra);
        *left_over = wastage;
}

Step-by-step:

Initialise wastage to the total size of the slab - 2^gfp_order.
Subtract the amount of space needed to store the slab descriptor, if this is kept on-slab.
Count the number of objects that may be stored, taking into account the kmem_bufctl_t required for each object if kept on-slab. Keep incrementing i until the slab is filled.
Return the number of objects and bytes wasted.

8.2.8 Slab Destroying

When a cache is being shrunk or destroyed, the slabs will be deleted and if the objects have destructors these must be called.
For example, kmem_slab_destroy() performs the following steps:

Call the destructors for every object in the slab.
If debugging is enabled - check the red marking and poison pattern.
Free the pages the slab uses.

8.3 Objects

Most of the really hard work of the slab allocator is handled by the cache and slab managers. Now we take a look at how objects themselves are managed.

8.3.1 Initialising Objects in a Slab

When a slab is created, all the objects in it are put in an initialised state. If a constructor is available, it is called for each object and it is expected that objects are left in an initialised state upon free.
Conceptually, the initialisation is very simple - cycle through each object, call its constructor and initialise the kmem_bufctl for it. This performed by kmem_cache_init_objs().

8.3.2 Object Allocation

kmem_cache_alloc() (and thus __kmem_cache_alloc()) is responsible for allocating one object to the specified cache:

static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags)
{
        unsigned long save_flags;
        void* objp;

        kmem_cache_alloc_head(cachep, flags);
try_again:
        local_irq_save(save_flags);
#ifdef CONFIG_SMP
        {
                cpucache_t *cc = cc_data(cachep);

                if (cc) {
                        if (cc->avail) {
                                STATS_INC_ALLOCHIT(cachep);
                                objp = cc_entry(cc)[--cc->avail];
                        } else {
                                STATS_INC_ALLOCMISS(cachep);
                                objp = kmem_cache_alloc_batch(cachep,cc,flags);
                                if (!objp)
                                        goto alloc_new_slab_nolock;
                        }
                } else {
                        spin_lock(&cachep->spinlock);
                        objp = kmem_cache_alloc_one(cachep);
                        spin_unlock(&cachep->spinlock);
                }
        }
#else
        objp = kmem_cache_alloc_one(cachep);
#endif
        local_irq_restore(save_flags);
        return objp;
alloc_new_slab:
#ifdef CONFIG_SMP
        spin_unlock(&cachep->spinlock);
alloc_new_slab_nolock:
#endif
        local_irq_restore(save_flags);
        if (kmem_cache_grow(cachep, flags))
                /* Someone may have stolen our objs.  Doesn't matter, we'll
                 * just come back here again.
                 */
                goto try_again;
        return NULL;
}

This function consists of 4 basic steps along with a 5th if the system is SMP:

Calls kmem_cache_alloc_head() to perform some basic sanity checks (i.e. whether the allocation is allowable.)
(SMP) If there is an object available on the per-CPU cache, return it, otherwise allocate cachep->batchcount objects in bulk and put them on the per-CPU cache (see 8.5 for more details on per-CPU caches.)
Select which slab to allocate from - either slabs_partial or slabs_free.
If slabs_free is chosen and has no available slabs, grow the cache (see 8.2.2.).
Allocate the object from the selected slab.

8.3.3 Object Freeing

kmem_cache_free() frees objects and is relatively straightforward - differing only depending on whether the system is SMP or UP.
If the system is SMP, the object is returned to the per-CPU cache.
Regardless of whether the system is SMP or not, the destructor for an object will be called if one is available. It is responsible for returning the object to its initialised state.

8.4 Sizes Cache

Linux keeps two sets of slab caches for small memory allocations for which the physical page allocator is unsuitable.
One set is for use with DMA, and the other is suitable for normal use. The human-readable names for these caches are 'size-N cache' and 'size-N(DMA) cache' which are viewable from /proc/slabinfo.
Each sized cache is stored in struct cache_sizes (typedef'd to cache_sizes_t):

/* Size description struct for general caches. */
typedef struct cache_sizes {
        size_t          cs_size;
        kmem_cache_t    *cs_cachep;
        kmem_cache_t    *cs_dmacachep;
} cache_sizes_t;

Looking at each field:

cs_size - The size of the memory block.
cs_cachep - The cache of blocks for normal memory use.
cs_dmacachep - The cache of blocks for use with DMA.

Because a limited number of these caches exist, a null-terminated static array, cache_sizes, is initialised at compile time, starting with 32 bytes on a 4KiB machine or 64 bytes for those systems with larger page sizes, and reaching 2^17 (131KiB):

static cache_sizes_t cache_sizes[] = {
#if PAGE_SIZE == 4096
        {    32,        NULL, NULL},
#endif
        {    64,        NULL, NULL},
        {   128,        NULL, NULL},
        {   256,        NULL, NULL},
        {   512,        NULL, NULL},
        {  1024,        NULL, NULL},
        {  2048,        NULL, NULL},
        {  4096,        NULL, NULL},
        {  8192,        NULL, NULL},
        { 16384,        NULL, NULL},
        { 32768,        NULL, NULL},
        { 65536,        NULL, NULL},
        {131072,        NULL, NULL},
        {     0,        NULL, NULL}
};

This array is initialised via kmem_cache_sizes_init() at startup.

8.4.1 kmalloc()

Given the existence of this cache, the slab allocator is now able to offer a new function, kmalloc(), for use when small memory buffers are required.
When a request is received, the appropriate sizes cache is selected and an object is assigned from it.
The hard work was done in object/cache/slab allocation, so kmalloc() itself is fairly simple.

8.4.2 kfree()

Since there's a kmalloc() there needs to be an equivalent free function, kfree().
As with kmalloc() the hard work is done in object/cache/slab freeing, so the kfree() function itself is fairly simple too.

8.5 Per-CPU Object Cache

One of the tasks of the slab allocator is to improve hardware cache use. One important aspect of doing this is to try to use data on the same CPU for as long as possible.
Linux achieves this by trying to keep objects in the same CPU cache via a per-CPU object cache (known as a 'cpucache') for each CPU in the system.
When allocating or freeing object on SMP systems, they are taken from/placed in the cpucache. When no objects are free on allocation (see 8.3.2), a batch of objects is placed into the pool.
When the cpucache pool gets too large, half of them are removed and placed into the global cache. This ensures the hardware cache will be used as long as is possible on the same CPU.
Another major benefit is that spinlocks do not have to be held when accessing the CPU pool because we are guaranteed that another CPU won't access the local data. This is important because without the caches, spinlocks would have to be acquired for every allocation and free which would be very expensive.

8.5.1 Describing the Per-CPU Object Cache

Each cache descriptor has a pointer to an array of cpucache's (struct cpucache_s, typedef'd to cpucache_t), described in the cache descriptor as cpucache_t *cpudata[NR_CPUS]; (where NR_CPUS is the maximum number of CPUs a system might have.)
The cpucache_t structure is simple:

typedef struct cpucache_s {
        unsigned int avail;
        unsigned int limit;
} cpucache_t;

Looking at each field:

avail - The number of free objects available on this cpucache.
limit - The total number of free objects that can exist.

A helper macro cc_data() is available to retrieve the cpucache for a given cache and processor:

#define cc_data(cachep) \
        ((cachep)->cpudata[smp_processor_id()])

Given a cache descriptor cachep this returns the appropriate pointer from the cpucache array cpudata via smp_processor_id().
Pointers on the cpucache are placed immediately after the cpucache_t struct, similar to how objects are stored after a slab descriptor in the on-slab case.

8.5.2 Adding/Removing Objects From the Per-CPU Cache

To prevent fragmentation, objects are always added or removed from the end of the array.
cc_entry() returns a pointer to the first object in the cpucache. It is similar in operation to slab_bufctl():

#define cc_entry(cpucache) \
        ((void **)(((cpucache_t*)(cpucache))+1))

To use cc_entry() to add an object to the CPU cache (cc):

void *objp;

...

cc_entry(cc)[cc->avail++] = objp;

To use cc_entry() to remove an object from the CPU cache (cc):

void *objp;

...

objp = cc_entry(cc)[--cc->avail];

8.5.3 enabling Per-CPU Caches

When a cache is created, its CPU cache has to be enabled and memory has to be allocated to it via kmalloc(), this is performed by enable_cpucache().
enable_cpucache() calls kmem_tune_cpucache() to allocate memory for each CPU cache.
A CPU cache cannot exist until the various sizes caches have been enabled, so a global variable g_cpucache_up is used to prevent CPU caches from being enabled prematurely.
The function enable_all_cpucaches() cycles through all caches in the cache chain and enables their cpucache and is called via kmem_cpucache_init().
After the CPU cache has been set up it can be accessed without locking because a CPU will never access the wrong cpucache, so it is guaranteed safe access to it.

8.5.4 Updating Per-CPU Information

When the per-cpu caches have been created or changed, each CPU is signalled by an IPI. It's not sufficientt o change all the values in the cache descriptor because that would lead to cache coherency issues and we'd be back having to use spinlocks.
A struct ccupdate_struct_s (typedef'd to ccupdate_struct_t) is populated with all the information that each CPU needs:

typedef struct ccupdate_struct_s
{
        kmem_cache_t *cachep;
        cpucache_t *new[NR_CPUS];
} ccupdate_struct_t;

Looking at each field:

cachep - The cache being updated.
new - An array of the cpucache descriptors for each CPU on the system.

smp_call_function_all_cpus() is used to get each CPU to call do_ccupdate_local() which swaps the information from ccupdate_struct_t with the information in the cache descriptor. After the information is swapped, the old data can be deleted.

8.5.5 Draining a Per-CPU Cache

When a cache is being shrunk the first step that needs to be performed is draining the cpucache's of any object that might have via drain_cpu_caches().
This is done so that the slab allocator will have a clearer view of which slabs can be freed or not. It matters because if just one object in a slab is placed in a per-CPU cache then the whole slab is not able to be freed.
If the system is tight on memory, saving a few milliseconds on allocations has a low priority.

8.6 Slab Allocator Initialisation

When the slab allocator creates a new cache, it allocates the kmem_cache_t from the cache_cache or a kmem_cache cache.
This is chicken-and-egg so the cache_cache is statically initialised:

/* internal cache of cache description objs */
static kmem_cache_t cache_cache = {
        slabs_full:     LIST_HEAD_INIT(cache_cache.slabs_full),
        slabs_partial:  LIST_HEAD_INIT(cache_cache.slabs_partial),
        slabs_free:     LIST_HEAD_INIT(cache_cache.slabs_free),
        objsize:        sizeof(kmem_cache_t),
        flags:          SLAB_NO_REAP,
        spinlock:       SPIN_LOCK_UNLOCKED,
        colour_off:     L1_CACHE_BYTES,
        name:           "kmem_cache",
};

Looking at the assignment of each field:

slabs_[fill|partial|free] - Initialises the 3 lists.
objsize - Set to the size of a cache descriptor.
flags - The creating and deleting of caches is extremely rare, so don't even consider it for reaping.
spinlock - Initialised unlocked.
colour_off - Aligned to the L1 cache.
name - A sane human-readable name.

The code statically defines all the fields that can be determined at compile-time, the rest are initialised via kmem_cache_init() which is called from start_kernel().

8.7 Interfacing with the Buddy Allocator

The slab allocator uses kmem_getpages() and kmem_freepages() to interact with the physical page allocator to obtain the actual pages of memory it needs.
kmem_getpages() and kmem_freepages() are essentially wrappers around the buddy allocator API so that slab flags will be taken into account for allocations.
For allocations, the default flags are taken from cachep->gfpflags and the order is taken from cachep->gfporder (assuming cachep is the kmem_cache_t in question)
When freeing pages, PageClearSlab() will be called for every page being freed before calling free_pages().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8.md

8.md

Chapter 8: Slab Allocator

8.1 Caches

8.1.1 Cache Descriptor

8.1.2 Cache Static Flags

8.1.3 Cache Dynamic Flags

8.1.4 Cache Allocation Flags

8.1.5 Cache Colouring

8.1.6 Cache Creation

8.1.7 Cache Reaping

8.1.8 Cache Shrinking

8.1.9 Cache Destroying

8.2 Slabs

8.2.1 Storing the Slab Descriptor

8.2.2 Slab Creation

8.2.3 Tracking Free Objects

8.2.4 Initialising the kmem_bufctl_t Array

8.2.5 Finding the Next Free Object

8.2.6 Updating kmem_bufctl_t

8.2.7 Calculating the Number of Objects on a Slab

8.2.8 Slab Destroying

8.3 Objects

8.3.1 Initialising Objects in a Slab

8.3.2 Object Allocation

8.3.3 Object Freeing

8.4 Sizes Cache

8.4.1 kmalloc()

8.4.2 kfree()

8.5 Per-CPU Object Cache

8.5.1 Describing the Per-CPU Object Cache

8.5.2 Adding/Removing Objects From the Per-CPU Cache

8.5.3 enabling Per-CPU Caches

8.5.4 Updating Per-CPU Information

8.5.5 Draining a Per-CPU Cache

8.6 Slab Allocator Initialisation

8.7 Interfacing with the Buddy Allocator

Files

8.md

Latest commit

History

8.md

File metadata and controls

Chapter 8: Slab Allocator

8.1 Caches

8.1.1 Cache Descriptor

8.1.2 Cache Static Flags

8.1.3 Cache Dynamic Flags

8.1.4 Cache Allocation Flags

8.1.5 Cache Colouring

8.1.6 Cache Creation

8.1.7 Cache Reaping

8.1.8 Cache Shrinking

8.1.9 Cache Destroying

8.2 Slabs

8.2.1 Storing the Slab Descriptor

8.2.2 Slab Creation

8.2.3 Tracking Free Objects

8.2.4 Initialising the kmem_bufctl_t Array

8.2.5 Finding the Next Free Object

8.2.6 Updating kmem_bufctl_t

8.2.7 Calculating the Number of Objects on a Slab

8.2.8 Slab Destroying

8.3 Objects

8.3.1 Initialising Objects in a Slab

8.3.2 Object Allocation

8.3.3 Object Freeing

8.4 Sizes Cache

8.4.1 kmalloc()

8.4.2 kfree()

8.5 Per-CPU Object Cache

8.5.1 Describing the Per-CPU Object Cache

8.5.2 Adding/Removing Objects From the Per-CPU Cache

8.5.3 enabling Per-CPU Caches

8.5.4 Updating Per-CPU Information

8.5.5 Draining a Per-CPU Cache

8.6 Slab Allocator Initialisation

8.7 Interfacing with the Buddy Allocator