-
The kernel's 'general purpose' allocator is a 'slab allocator'. The basic concept is to have caches of commonly used objects kept in an initialised state, available for use by the kernel.
-
Without an object-based allocator, the kernel would spend a lot of the time allocating, initialising and freeing the same object over and over again - the slab allocator aims to cache freed objects so they can be reused.
-
The slab allocator consists of a variable number of caches linked together on a doubly-linked circular list called a 'cache chain'.
-
In the context of the slab allocator a cache is a manager for a number of objects of a particular type, e.g. a struct mm_struct cache, each of which are managed by a struct kmem_cache_s.
-
Each cache maintains blocks of contiguous pages in memory called 'slabs' that are carved into small chunks for the objects that the cache manages:
------------- --------- -------------
| lastcache |--------->| cache |--------->| nextcache |
------------- --------- -------------
| | |
| | |
/-------------/ | \--------------\
| | |
v v v
-------------- ----------------- --------------
| slabs_full | | slabs_partial | | slabs_free |
-------------- ----------------- --------------
| | |
| | |
| | |
| | |
v v v
--------- --------- ---------
| slabs | | slabs | | slabs |
--------- --------- ---------
| | |
| | |
| | |
| | |
v v v
--------- --------- ---------
| pages | | pages | | pages |
--------- --------- ---------
| | |
| | |
/-------\ /-------\ /-------\
| | | | | |
v v v v v v
------- ------- ------- ------- ------- -------
| obj | | obj | | obj | | obj | | obj | | obj |
------- ------- ------- ------- ------- -------
- The slab allocator has 3 principle aims:
-
The allocation of small blocks of memory to help eliminate internal fragmentation that would be otherwise caused by the buddy system.
-
The caching of commonly used objects so the system does not waste time allocating, initialising and destroying objects.
-
Better use of the hardware cache by aligning objects to the L1 or L2 caches.
-
To help eliminate internal fragmentation normally caused by a binary buddy allocator, two sets of caches of small memory buffers ranging from
2^5
(32) bytes to2^17
(131,072) bytes are maintained. -
One cache set is suitable for use with DMA devices, the other not, these are called 'size-N(DMA)' and 'size-N' respectively, where N is the size of the allocation.
-
The function kmalloc() is provided for allocating to these (see 8.4.1.) Additionally, the sizes of the caches are discussed in further detail in 8.4.
-
The slab allocator also maintains caches of commonly used objects - for many structures used in the kernel, the time needed to initialise an object is comparable with, or exceeds, the cost of allocating space for it.
-
When a new slab is created, a number of objects are packed into it and initialised using a constructor if available. When an object is freed, it is left in its initialised state so that object allocation will be quick.
-
The final task of the slab allocator is to optimise the use of hardware caches. In order to do this, 'slab colouring' is performed - if there is space left over after objects are packed into a slab, the remaining space is used to colour the slab.
-
Slab colouring is a scheme that attempts to ensure that objects in different slabs use different lines in the cache. By placing objects at different starting offsets within the slab, they will likely use different lines in the CPU cache which helps ensure that objects from the same slab cache will be unlikely to flush each other.
-
With this scheme, space that would otherwise be wasted is used to fulfil a new function. Diagrammatically:
------------------
| |
| |
| |
| |
| Rest of page |
---------------------- | filled with |
| kmem_getpages() | | objects and |
---------------------- | padding |
| | |
| | | ----------------------
| | | | kmem_cache_alloc() |
v | | ----------------------
---------------------- |----------------| |
| __get_free_pages() | | Colour padding | |
---------------------- |----------------| |
| | |<------------/
| | Initialised | -----------------------
| | Object | | Object allocated to |
v | |---------------->| requesting process |
---------------------- |----------------| ^ -----------------------
| Allocates pages | | Colour padding | |
| of order | |----------------| |
| cachep->gfp_order | | | | L1 CPU
---------------------- | Initialised | | Cache Size
| | Object | |
| | | |
\--------------->|----------------- v
-
Linux does not attempt to colour page allocations based on the physical address, or to order where objects are placed. The details of cache colours is discussed further in 8.1.5.
-
On an SMP system, a further step is taken to help cache utilisation where each cache has a small array of objects reserved for each CPU - this is discussed in more detail in 8.5.
-
The slab allocator allows for slab debugging if
CONFIG_SLAB_DEBUG
is set. This adds 'red zoning' and 'objection poisoning'. -
Red zoning adds a market at either end of the object - if it is disturbed the allocator knows which object encountered a buffer overflwo, and reports it.
-
Poisoning an object will fill it with a predefined bit pattern (POISON_BYTE which is defined as
0x5a
) at slab creation and after free. On each allocation this pattern is checked - if it is changed, the allocator knows the object was used before it was allocated, and flags it. -
Taking a look at the slab allocator's API:
-
kmem_cache_create() - Creates a new cache and adds it to the cache chain.
-
kmem_cache_reap() - Scans at most REAP_SCANLEN caches and selects one for reaping all per-CPU objects and free slabs from. This is called when memory is tight.
-
kmem_cache_shrink() - Deletes all per-CPU objects associated with a cache and deletes all slabs in the
kmem_cache_s->slabs_free
list. It returns the number of freed pages. -
kmem_cache_alloc() (and subsequently __kmem_cache_alloc()) - Allocates a single object from the cache and returns it to the caller.
-
kmem_cache_free() - Frees an object and returns it to the cache.
-
kmalloc() - Allocates a block of memory from one of the specified
size
s cache. -
kfree() - Frees a block of memory allocated with
kmalloc()
. -
kmem_cache_destroy() - Destroys all objects in all slabs and frees up all associated memory before removing the cache from the chain.
- One cache exists for each type of object that is to be cached. To see a list
of caches available on a running system
/proc/slabinfo
can be queried, e.g. (from my local 2.4.22 VM system:
morgoth:~# cat /proc/slabinfo
slabinfo - version: 1.1 (SMP)
kmem_cache 64 64 244 4 4 1 : 252 126
tcp_tw_bucket 0 0 96 0 0 1 : 252 126
tcp_bind_bucket 0 0 32 0 0 1 : 252 126
tcp_open_request 0 0 64 0 0 1 : 252 126
inet_peer_cache 0 0 64 0 0 1 : 252 126
ip_fib_hash 113 113 32 1 1 1 : 252 126
ip_dst_cache 48 48 160 2 2 1 : 252 126
arp_cache 30 30 128 1 1 1 : 252 126
blkdev_requests 3080 3080 96 77 77 1 : 252 126
nfs_write_data 0 0 352 0 0 1 : 124 62
nfs_read_data 0 0 352 0 0 1 : 124 62
nfs_page 0 0 96 0 0 1 : 252 126
journal_head 234 234 48 3 3 1 : 252 126
revoke_table 126 253 12 1 1 1 : 252 126
revoke_record 0 0 32 0 0 1 : 252 126
dnotify_cache 0 0 20 0 0 1 : 252 126
file_lock_cache 80 80 96 2 2 1 : 252 126
fasync_cache 0 0 16 0 0 1 : 252 126
uid_cache 226 226 32 2 2 1 : 252 126
skbuff_head_cache 384 384 160 16 16 1 : 252 126
sock 54 54 832 6 6 2 : 124 62
sigqueue 58 58 132 2 2 1 : 252 126
kiobuf 0 0 64 0 0 1 : 252 126
cdev_cache 118 118 64 2 2 1 : 252 126
bdev_cache 59 59 64 1 1 1 : 252 126
mnt_cache 118 118 64 2 2 1 : 252 126
inode_cache 413 413 512 59 59 1 : 124 62
dentry_cache 570 570 128 19 19 1 : 252 126
filp 150 150 128 5 5 1 : 252 126
names_cache 4 4 4096 4 4 1 : 60 30
buffer_head 2360 2360 96 59 59 1 : 252 126
mm_struct 48 48 160 2 2 1 : 252 126
vm_area_struct 400 400 96 10 10 1 : 252 126
fs_cache 118 118 64 2 2 1 : 252 126
files_cache 36 36 416 4 4 1 : 124 62
signal_act 27 27 1312 9 9 1 : 60 30
size-131072(DMA) 0 0 131072 0 0 32 : 0 0
size-131072 0 0 131072 0 0 32 : 0 0
size-65536(DMA) 0 0 65536 0 0 16 : 0 0
size-65536 0 0 65536 0 0 16 : 0 0
size-32768(DMA) 0 0 32768 0 0 8 : 0 0
size-32768 0 0 32768 0 0 8 : 0 0
size-16384(DMA) 0 0 16384 0 0 4 : 0 0
size-16384 0 0 16384 0 0 4 : 0 0
size-8192(DMA) 0 0 8192 0 0 2 : 0 0
size-8192 4 4 8192 4 4 2 : 0 0
size-4096(DMA) 0 0 4096 0 0 1 : 60 30
size-4096 267 267 4096 267 267 1 : 60 30
size-2048(DMA) 0 0 2048 0 0 1 : 60 30
size-2048 8 8 2048 4 4 1 : 60 30
size-1024(DMA) 0 0 1024 0 0 1 : 124 62
size-1024 108 108 1024 27 27 1 : 124 62
size-512(DMA) 0 0 512 0 0 1 : 124 62
size-512 56 56 512 7 7 1 : 124 62
size-256(DMA) 0 0 256 0 0 1 : 252 126
size-256 105 105 256 7 7 1 : 252 126
size-128(DMA) 0 0 128 0 0 1 : 252 126
size-128 510 510 128 17 17 1 : 252 126
size-64(DMA) 0 0 64 0 0 1 : 252 126
size-64 177 177 64 3 3 1 : 252 126
size-32(DMA) 0 0 32 0 0 1 : 252 126
size-32 565 565 32 5 5 1 : 252 126
- Each of the column fields correspond to a field in struct kmem_cache_s. Looking at each one in order:
-
cache-name
- A human-readable name e.g. 'tcp_bind_bucket' -
num-active-objs
- Number of objects that are in use. -
total-objs
- Number of available objects, including unused. -
obj-size
- The size of each object, typically small. -
num-active-slabs
- Number of slabs containing objects that are active. -
total-slabs
- Total number of slabs. -
num-pages-per-slab
- The pages required to create one slab, typically 1. -
limit
(SMP systems, after the colon) - The number of free objects the pool can have before half of it is given to the global free pool. -
batchcount
(SMP systems, after the colon) - The number of objects allocated for the processor in a block when no objects are free.
-
Columns 8 and 9 are SMP-specific (shown after the colon in the
/proc/slabinfo
output) and refer to the per-CPU cache which is explored further in 8.5. -
To speed allocation and freeing of objects and slabs, they are arranged into 3 lists -
slabs_full
,slabs_partial
andslabs_free
. Entries inslabs_full
have all of their objects in use. Entries inslabs_partial
have free objects available so is a prime candidate for object allocation and finallyslabs_free
entries have no allocated objects, so is a candidate for slab destruction.
- All information describing a cache is stored in a struct kmem_cache_s. This is a rather large data structure:
struct kmem_cache_s {
/* 1) each alloc & free */
/* full, partial first, then free */
struct list_head slabs_full;
struct list_head slabs_partial;
struct list_head slabs_free;
unsigned int objsize;
unsigned int flags; /* constant flags */
unsigned int num; /* # of objs per slab */
spinlock_t spinlock;
#ifdef CONFIG_SMP
unsigned int batchcount;
#endif
/* 2) slab additions /removals */
/* order of pgs per slab (2^n) */
unsigned int gfporder;
/* force GFP flags, e.g. GFP_DMA */
unsigned int gfpflags;
size_t colour; /* cache colouring range */
unsigned int colour_off; /* colour offset */
unsigned int colour_next; /* cache colouring */
kmem_cache_t *slabp_cache;
unsigned int growing;
unsigned int dflags; /* dynamic flags */
/* constructor func */
void (*ctor)(void *, kmem_cache_t *, unsigned long);
/* de-constructor func */
void (*dtor)(void *, kmem_cache_t *, unsigned long);
unsigned long failures;
/* 3) cache creation/removal */
char name[CACHE_NAMELEN];
struct list_head next;
#ifdef CONFIG_SMP
/* 4) per-cpu data */
cpucache_t *cpudata[NR_CPUS];
#endif
#if STATS
unsigned long num_active;
unsigned long num_allocations;
unsigned long high_mark;
unsigned long grown;
unsigned long reaped;
unsigned long errors;
#ifdef CONFIG_SMP
atomic_t allochit;
atomic_t allocmiss;
atomic_t freehit;
atomic_t freemiss;
#endif
#endif
};
- Looking at each field in turn:
-
slabs_[full, partial, free]
- These are the three lists where the slabs are stored as discussed above. -
objsize
- Size of each object packed into the slab. -
flags
- Flags that determine how parts of the allocator will behave when dealing with the cache. Discussed in 8.1.2 -
num
- This is the number of objects contained in each slab. -
spinlock
- Protects the structure from concurrent accesses. -
batchcount
- Number of objects that will be allocated in batch for the per-cpu caches, as discussed above. -
gfporder
- This indicates the size of the slab in pages. Each slab consumes2^gfporder
pages, because these are the allocation sizes that the buddy allocator provides. -
gfpflags
- The GFP flags used when calling the buddy allocator to allocate pages. See 6.4 for more details. -
colour
- The number of different offsets that can be used for cache colouring (cache colouring is discussed further in 8.1.5.) -
colour_off
- The byte alignment to keep slabs at. For example, slabs for the size-X caches are aligned on the L1 cache. -
colour_next
- The next colour line to use. Wraps back to 0 when it reachescolour
. -
slabp_cache
- Cache for off-slab slab management objects (see 8.2.1.) -
growing
- Set to indicate if the cache is growing or not. If so, it is much less likely that this cache will be selected to reap free slabs under memory pressure. -
dflags
- Dynamic flags that change during cache lifetime (see 8.1.3.) -
ctor
- Callback that can be called to initialise each new object, defaults to NULL meaning no such function is called. -
dtor
- Similar toctor
but for object destruction. -
failures
- Unused - Set to 0 then ignored. We have no failures bitches! -
name
- Human-readable name for the cache. -
next
- struct list_head field for next cache on the cache line. -
cpudata
- Per-CPU data. Discussed further in 8.5. -
num_active
- (debug) - The current number of active objects -
num_allocations
- (debug) - Running total of the number of objects that have been allocated on this cache. -
high_mark
- (debug) - The highest valuenum_active
has had, to date. -
grown
- (debug) - The number of times kmem_cache_grow() has been called. -
reaped
- (debug) - Number of times this cache has been reaped. -
errors
- (debug, unused) -
allochit
- (debug) - Total number of times an allocation has used the per-CPU cache. -
allocmiss
- (debug) - Total number of times an allocation has missed the per-CPU cache. -
freehit
- (debug) - Total number of times a freed object was placed on a per-CPU cache. -
freemiss
- (debug) - Total number of times an object was freed and placed in the global pool.
- The statistical fields (
num_active
, etc. behind the#if STATS
) are only available ifCONFIG_SLAB_DEBUG
is enabled./proc/slabinfo
does not rely on these being present, rather it calculates the values when the proc entry is read by another process by examining every slab used by each cache.
-
A number of flags are set at cache creation time that remain the same for the lifetime of the cache and affect how the slab is structured and how objects are stored within it.
-
All of the flags are stored in a bitmask in the
flags
field of the cache descriptor. -
Looking at each of the internal flags defined in mm/slab.c:
-
CFLGS_OFF_SLAB
- Indicates that the slab managers for this cache are kept off-slab. Discussed further in 8.2.1. -
CFLGS_OPTIMIZE
- Only set, but never used. We always optimise, baby!
- Looking at each of the flags that are set by the cache creator:
-
SLAB_HWCACHE_ALIGN
- Align objects to the L1 CPU cache. -
SLAB_MUST_HWCACHE_ALIGN
- Force alignment to the L1 CPU cache even if it is very wasteful or slab debugging is enabled. -
SLAB_NO_REAP
- Never reap slabs in this cache. -
SLAB_CACHE_DMA
- Allocate slabs with memory fromZONE_DMA
.
- Finally, let's consider some debug flags, which require
CONFIG_SLAB_DEBUG
to be set:
-
SLAB_DEBUG_FREE
- Perform expensive checks on free. -
SLAB_DEBUG_INITIAL
- On free, call the constructor as a verifier to ensure the object is still initialised correctly. -
SLAB_RED_ZONE
- Places a marker at either end of objects to trap overflows. -
SLAB_POISON
- Poison objects with a known pattern for trapping changes made to objects not allocated or initialised (see above)
- To prevent callers from using the wrong flags, a CREATE_MASK is
provided that 'masks in' all the allowable flags. When a cache is being
created, the requested flags are compared against the
CREATE_MASK
and reported as a bug if invalid flags are used.
-
The
dflags
field has only one flag -DFLGS_GROWN
- but it is important. It is set during kmem_cache_grow() so that kmem_cache_reap() will be unlikely to choose the cache for reaping. -
When
kmem_cache_reap()
finds a cahce withDFLGS_GROWN
set it skips it, removing the flag int he process.
- These flags correspond to the GFP page flag options for allocating pages for
slabs. Some callers use
GFP_...
andSLAB_...
flags interchangeably, however this is incorrect and onlySLAB_...
flags should be used:
SLAB_ATOMIC
SLAB_DMA
SLAB_KERNEL
SLAB_NFS
SLAB_NOFS
SLAB_NOHIGHIO
SLAB_NOIO
SLAB_USER
- A small number of flags may be passed to constructor and destructor functions:
-
SLAB_CTOR_CONSTRUCTOR
- Set if the function is being called as a constructor for caches that use the same function for constructors and destructors. -
SLAB_CTOR_ATOMIC
- Indicates that the constructor may not sleep. -
SLAB_CTOR_VERIFY
- Indicates that the constructor should just verify that the object is initialised properly.
-
As discussed previously, the slab allocator attempts to better make use of the hardware cache by offsetting objects in different slabs by differing amounts depending on the amount of space left over in the slab.
-
The offset is in units of
BYTES_PER_WORD
unless theSLAB_HWCACHE_ALIGN
flag is set, in which case it is aligned to blocks ofL1_CACHE_BYTES
for alignment to the L1 hardware cache. -
During cache creation the number of objects that can fit on a slab (see 8.2.7) and how many bytes would be wasted are calculated. Based on wastage, 2 figures are calculated for the cache descriptor:
-
colour
- The number of different offsets that can be used. -
colour_off
- The multiple to offset each object in the slab.
-
With the objects offset, they will use different lines on the associative hardware cache. This makes it less likely for objects from slabs to cause flushes for another object.
-
Consider an example - Let's say that
s_mem
(the address of the first object on the slab) is 0, alignment is to 32 bytes for the L1 hardware cache, and 100 bytes are wasted on the slab. -
In this scenario, the first slab created will have its objects start at 0. The second will start at 32, the third at 64, the fourth at 96 and the fifth will start back at 0.
-
This makes it unlikely objects from each of the slabs will hit the same hardware cache line on the CPU.
-
In this example,
colour
would be 3 (there are 3 offsets - 32, 64, 96), andcolour_off
would be 32.
- The function kmem_cache_create() is responsible for creating new caches and adding them to the cache chain. The tasks that are performed to create a cache are:
-
Perform basic sanity checks to avoid bad usage.
-
Perform debugging checks if
CONFIG_SLAB_DEBUG
is set. -
Allocate a kmem_cache_t (typedef'd from a struct kmem_cache_s) from the cache_cache slab cache.
-
Align the object size to the word size.
-
Calculate how many objects will fit on a slab.
-
Align the object to the hardware cache.
-
Calculate colour offsets.
-
Initialise remaining fields in the cache descriptor.
-
Add the new cache to the cache chain.
-
When a slab is freed, it is placed on the
slabs_free
list for future use. Caches do not automatically shrink themselves, so, whenkswapd
notices memory is tight, it calls kmem_cache_reap() to free some memory. -
kmem_cache_reap
is responsible for for selecting a cache which will be required to shrink its memory usage. -
Note that cache reaping does not take into account what memory node or zone is under pressure. This means that with a NUMA or high memory machine it is possible the kernel will spend a lot of time freeing memory from regions that are under no memory pressure. This is not a problem for an architecure like i386 which has only 1 bank of memory.
-
In the event that the system has a large number of caches, only REAP_SCANLEN (10 by default) caches are examined in each call.
-
The last cache to be scanned is stored in the variable clock_searchp to avoid examining the same caches repeatedly.
-
For each scanned cache, the reaper:
-
Checks flags for
SLAB_NO_REAP
and skip if set. -
If the cache is growing, skip it.
-
If the cache has grown recently, or is currently growing,
DFLGS_GROWN
will be set. If this flag is set, the slab is skipped, but the flag is cleared so the cache will be a reap candidate the next time round. -
Count the number of free slabs in the
slabs_free
field and calculate how many pages reaping this cache would free, storing in the local variablepages
. -
If the cache has constructors or large slabs, adjust
pages
to make it less likely for the cache to be selected. -
If the number of pages that would be free exceeds REAP_PERFECT (defaults to 10), free half of the slabs in
slabs_free
. -
Otherwise, scan the rest of the caches and select the one that would free the most pages if the function was to free half of the slabs in
slabs_free
.
- When a cache is selected to shrink itself, the steps it takes are simple and brutal:
-
Delete all objects in the per-CPU caches.
-
Delete all slabs from the
slabs_free
field unless the growing flag gets set.
-
Linux is nothing, if not subtle :) Two varieties of shrink functions are provided with confusingly similar names - kmem_cache_shrink() and __kmem_cache_shrink().
-
kmem_cache_shrink()
removes all slabs from the cache'sslabs_free
field and returns the number of pages freed as a result - it's the principle shrinking function exported for use by slab allocator users. -
__kmem_cache_shrink()
frees all slabs from the cache'sslabs_free
field and then verifies thatslabs_partial
andslabs_full
are empty. This is for internal use only, and used during cache destruction when it doesn't matter how many pages are freed, just that the cache is empty.
-
When a module is unloaded, it is responsible for destroying any cache via kmem_cache_destroy(). It is vital that the cache is properly destroyed because two caches of the same human-readable name are not allowed to exist.
-
Core kernel systems often don't bother to destroy its caches because they last for the entire life of the system.
-
The steps taken to destroy a cache are:
-
Delete the cache from the cache chain.
-
Shrink the cache to delete all slabs.
-
Free any per-CPU caches via kfree().
-
Delete the cache descriptor from the cache_cache.
- struct slab_s (typedef'd to
slab_t
) describes a slab. It is much simpler than the cache descriptor (though the way it's arranged is considerably more complex):
/*
* slab_t
*
* Manages the objs in a slab. Placed either at the beginning of mem allocated
* for a slab, or allocated from an general cache.
* Slabs are chained into three list: fully used, partial, fully free slabs.
*/
typedef struct slab_s {
struct list_head list;
unsigned long colouroff;
void *s_mem; /* including colour offset */
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free;
} slab_t;
- Looking at each field:
-
list
- The linked list the slab belongs to - it'll be one ofslab_[full|partial|free]
from the cache manager. -
colouroff
- The colour offset from the base address of the first object within the slab. The address of the first object is therefores_mem + colouroff
. -
s_mem
- The starting address of the first object within the slab. -
inuse
- The number of active objects in the slab. -
free
- The index of the next free object in the array of kmem_bufctl_t's (kmem_bufctl_t
is typedef'd tounsigned int
) that starts after the end ofslab_t
(slab_t
is always kept at the start of a page frame so there is space for more data afterwards.) Discussed further in 8.2.3.
-
There is no obvious way to determine which cache a slab belongs to or vice-versa. It turns out that we can use the
list
field of the struct page (see 2.5 - thelist
field is a generic field that can be used for different purposes, which is quite surprising :) -
In order to set/get the page's cache/slab you can use the SET_PAGE_CACHE(), SET_PAGE_SLAB(), GET_PAGE_CACHE(), and GET_PAGE_SLAB() macros which simply place caches in the
next
field of the list and slabs in theprev
field:
/* Macros for storing/retrieving the cachep and or slab from the
* global 'mem_map'. These are used to find the slab an obj belongs to.
* With kfree(), these are used to find the cache which an obj belongs to.
*/
#define SET_PAGE_CACHE(pg,x) ((pg)->list.next = (struct list_head *)(x))
#define GET_PAGE_CACHE(pg) ((kmem_cache_t *)(pg)->list.next)
#define SET_PAGE_SLAB(pg,x) ((pg)->list.prev = (struct list_head *)(x))
#define GET_PAGE_SLAB(pg) ((slab_t *)(pg)->list.prev)
- Diagrammatically (for a case where a slab is on the
slabs_partial
list):
struct kmem_cache_s (=kmem_cache_t)
/-->-----------------------------------
| | struct list_head *slabs_full |
| |---------------------------------|
| | struct list_head *slabs_partial |-------------\
| |---------------------------------| |
| / / v
| \ ... \ ---------------
| / / | Linked list | More slab_t's...
| ----------------------------------- | of slab_t |- - >
| ---------------
| |
| |
| |
| struct page v
| next ------------------------- prev ----------
\--------| struct list_head list |------------>| slab_t |
|-----------------------| ----------
/ /
\ ... \
/ /
|-----------------------|
| flags with PG_slab |
| bit set |
|-----------------------|
/ /
\ ... \
/ /
-------------------------
-
The last issue is where the slab management data structure slab_t is kept. They are either kept on-slab or off-slab depending on whether
CFLGS_OFF_SLAB
is set in the static flags. -
The size of the object determines whether
slab_t
is stored on or off-slab. In the above diagram it's possible forslab_t
to be kept at the beginning of the page frame, though the diagram implies it is kept separate, this is not necessarily the case.
-
If the objects are larger than a threshold (PAGE_SIZE>>3 so 512 bytes on i386),
CFLGS_OFF_SLAB
is set in the cache flags and the slab descriptor/manager/management data structure is kept off-slab in one of the 'sizes caches' (see 8.4.) -
The selected cache will be large enough to contain the slab_t and kmem_cache_slabmgmt() allocates from it as necessary.
-
Keeping the
slab_t
off-slab limits the number of objects that can be stored on the slab because there is limited space for the kmem_bufctl_t's (see 8.2.3) - however that shouldn't matter because the objects are large so there should be fewer stored on a single slab. -
If the slab manager is kept on-slab it is reserved at the beginning of the slab. Enough space is kept to store the
slab_t
and thekmem_bufctl_t
s. -
The
kmem_bufctl_t
s are an array of unsigned integers used for tracking the index of the next free object available for use, see 8.2.3 for more details. -
The actual objects in the slab are stored after the
kmem_bufctl_t
array. -
A slab with on-slab descriptor, diagrammatically (again, assuming the slab we're looking at lives in
slabs_partial
):
struct kmem_cache_s (=kmem_cache_t)
-----------------------------------
| struct list_head *slabs_full |
|---------------------------------|
| struct list_head *slabs_partial |-------------\
|---------------------------------| |
/ / v
\ ... \ ---------------
/ / | Linked list | More slab_t's...
----------------------------------- | of slab_t | - - >
---------------
|
|
/--------------------------------------------/
|
|
v Page frame
---------------------------
| struct list_head list |
|-------------------------|
| unsigned long colouroff |
|-------------------------|
/--| void *s_mem |
| |-------------------------|
| | unsigned int inuse |
| |-------------------------|
| | kmem_bufctl_t free |
| |-------------------------|
| / /
| \ kmem_bufctl_t array... \
| / /
\->|-------------------------|
| Object |
|-------------------------|
| Object |
|-------------------------|
| Object |
|-------------------------|
/ /
\ More objects... \
/ /
---------------------------
- A slab with off-slab descriptor, diagrammatically (again, assuming the slab
we're looking at lives in
slabs_partial
):
struct kmem_cache_s (=kmem_cache_t)
-----------------------------------
| struct list_head *slabs_full |
|---------------------------------|
| struct list_head *slabs_partial |-------------\
|---------------------------------| |
/ / v
\ ... \ ---------------
/ / | Linked list | More slab_t's...
----------------------------------- | of slab_t | - - >
---------------
|
|
/--------------------------------------------/
|
|
v Page frame for slab_t Page frame for objects
--------------------------- /------------->---------------------------
| struct list_head list | | | Object |
|-------------------------| | |-------------------------|
| unsigned long colouroff | | | Object |
|-------------------------| | |-------------------------|
| void *s_mem |---------------/ | Object |
|-------------------------| |-------------------------|
| unsigned int inuse | / /
|-------------------------| \ More objects... \
| kmem_bufctl_t free | / /
|-------------------------| ---------------------------
/ /
\ kmem_bufctl_t array... \
/ /
---------------------------
-
So far we've seen how the cache is created, but not how the
slabs_full
,slabs_partial
andslabs_free
lists are grown, as they all start empty - i.e., how slabs are allocated. -
New slabs are allocated to a cache via kmem_cache_grow() and is referred to as 'cache growing'.
-
Cache growing occurs when no objects are left in the
slabs_partial
list and when there are no slabs inslabs_free
. -
kmem_cache_grow()
does the following:
-
Performs basic sanity checks to guard against bad usage.
-
Calculates the colour offset for objects in the slab.
-
Allocates memory for the slab and acquires a slab descriptor.
-
Links the pages used for the slab to the slab and cache descriptors (see 8.2.)
-
Initialises objects in the slab.
-
Adds the slab to the cache.
-
The slab allocator needs to have a quick and simple means of tracking where free objects are located on the partially-filled slabs. This is achieved by using an array of unsigned integers of type kmem_bufctl_t that is associated with each 'slab manager'.
-
The number of objects in the array is the same as the number of objects on the slab.
-
The array of
kmem_bufctl_t
s is kept after the slab descriptor in the page frame in which it is kept. In order to make life easier there is a macro which gives access to the array:
#define slab_bufctl(slabp) \
((kmem_bufctl_t *)(((slab_t*)slabp)+1))
- The
free
field of the slab management structure contains the index of the next free object in the slab. When objects are allocated or freed, this is updated based on information in thekmem_bufctl_t
array.
-
When a cache is grown, all the objects and the kmem_bufctl_t array on the slab are initialised.
-
The array is filled with the index of each object begining with 1 and ending with the marker BUFCTL_END, e.g. for a slab with five objects:
------------------------------
| 1 | 2 | 3 | 4 | BUFCTL_END |
------------------------------
-
The value 0 is stored in
slab_t->free
because the 0th object is the first free object to be used. The idea here is that for a given objecti
, the index of the next free object will be stored inslab_bufctl(slabp)[i]
. -
Using the array this way makes it act like a LIFO (stack) for free objects.
-
When allocating an object, kmem_cache_alloc() updates the
kmem_bufctl_t
array by calling kmem_cache_alloc_one_tail(). -
As discussed in 8.2.4,
slab_t->free
contains the index of the first free object and the index of the next free object is atslab_bufctl(slabp)[slab_t->free]
. The code for navigating this (taken fromkmem_cache_alloc_one_tail()
) looks like:
void *objp;
...
objp = slabp->s_mem + slabp->free*cachep->objsize;
slabp->free=slab_bufctl(slabp)[slabp->free];
-
This is in effect 'popping' an index off the stack.
-
Note that the
kmem_bufctl_t
array is not changed during allocations, just updated on allocation/free. LS - there is some really confusing stuff here about unallocated elements being unreachable (after allocations?), skipping as I think there may be a mistake here.
- The
kmem_bufctl_t
list is only updated when an object is freed via kmem_cache_free_one():
unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
slab_bufctl(slabp)[objnr] = slabp->free;
slabp->free = objnr;
- This is in effect 'pushing' an index back on to the stack.
-
During cache creation kmem_cache_estimate() is called to calculate how many objects may be stored on a single slab, taking into account whether the slab descriptor is to be stored on or off-slab and the size taken up by each
kmem_bufctl_t
needed to track whether an object is free or not. -
kmem_cache_estimate()
returns the number objects that may be stored and how many bytes are wasted. The number of wasted bytes is important to know for cache colouring. -
The calculation is fairly straightforward:
/* Cal the num objs, wastage, and bytes left over for a given slab size. */
static void kmem_cache_estimate (unsigned long gfporder, size_t size,
int flags, size_t *left_over, unsigned int *num)
{
int i;
size_t wastage = PAGE_SIZE<<gfporder;
size_t extra = 0;
size_t base = 0;
if (!(flags & CFLGS_OFF_SLAB)) {
base = sizeof(slab_t);
extra = sizeof(kmem_bufctl_t);
}
i = 0;
while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
i++;
if (i > 0)
i--;
if (i > SLAB_LIMIT)
i = SLAB_LIMIT;
*num = i;
wastage -= i*size;
wastage -= L1_CACHE_ALIGN(base+i*extra);
*left_over = wastage;
}
- Step-by-step:
-
Initialise
wastage
to the total size of the slab -2^gfp_order
. -
Subtract the amount of space needed to store the slab descriptor, if this is kept on-slab.
-
Count the number of objects that may be stored, taking into account the
kmem_bufctl_t
required for each object if kept on-slab. Keep incrementingi
until the slab is filled. -
Return the number of objects and bytes wasted.
-
When a cache is being shrunk or destroyed, the slabs will be deleted and if the objects have destructors these must be called.
-
For example, kmem_slab_destroy() performs the following steps:
-
Call the destructors for every object in the slab.
-
If debugging is enabled - check the red marking and poison pattern.
-
Free the pages the slab uses.
- Most of the really hard work of the slab allocator is handled by the cache and slab managers. Now we take a look at how objects themselves are managed.
-
When a slab is created, all the objects in it are put in an initialised state. If a constructor is available, it is called for each object and it is expected that objects are left in an initialised state upon free.
-
Conceptually, the initialisation is very simple - cycle through each object, call its constructor and initialise the
kmem_bufctl
for it. This performed by kmem_cache_init_objs().
- kmem_cache_alloc() (and thus __kmem_cache_alloc()) is responsible for allocating one object to the specified cache:
static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags)
{
unsigned long save_flags;
void* objp;
kmem_cache_alloc_head(cachep, flags);
try_again:
local_irq_save(save_flags);
#ifdef CONFIG_SMP
{
cpucache_t *cc = cc_data(cachep);
if (cc) {
if (cc->avail) {
STATS_INC_ALLOCHIT(cachep);
objp = cc_entry(cc)[--cc->avail];
} else {
STATS_INC_ALLOCMISS(cachep);
objp = kmem_cache_alloc_batch(cachep,cc,flags);
if (!objp)
goto alloc_new_slab_nolock;
}
} else {
spin_lock(&cachep->spinlock);
objp = kmem_cache_alloc_one(cachep);
spin_unlock(&cachep->spinlock);
}
}
#else
objp = kmem_cache_alloc_one(cachep);
#endif
local_irq_restore(save_flags);
return objp;
alloc_new_slab:
#ifdef CONFIG_SMP
spin_unlock(&cachep->spinlock);
alloc_new_slab_nolock:
#endif
local_irq_restore(save_flags);
if (kmem_cache_grow(cachep, flags))
/* Someone may have stolen our objs. Doesn't matter, we'll
* just come back here again.
*/
goto try_again;
return NULL;
}
- This function consists of 4 basic steps along with a 5th if the system is SMP:
-
Calls kmem_cache_alloc_head() to perform some basic sanity checks (i.e. whether the allocation is allowable.)
-
(SMP) If there is an object available on the per-CPU cache, return it, otherwise allocate
cachep->batchcount
objects in bulk and put them on the per-CPU cache (see 8.5 for more details on per-CPU caches.) -
Select which slab to allocate from - either
slabs_partial
orslabs_free
. -
If
slabs_free
is chosen and has no available slabs, grow the cache (see 8.2.2.). -
Allocate the object from the selected slab.
-
kmem_cache_free() frees objects and is relatively straightforward - differing only depending on whether the system is SMP or UP.
-
If the system is SMP, the object is returned to the per-CPU cache.
-
Regardless of whether the system is SMP or not, the destructor for an object will be called if one is available. It is responsible for returning the object to its initialised state.
-
Linux keeps two sets of slab caches for small memory allocations for which the physical page allocator is unsuitable.
-
One set is for use with DMA, and the other is suitable for normal use. The human-readable names for these caches are 'size-N cache' and 'size-N(DMA) cache' which are viewable from
/proc/slabinfo
. -
Each sized cache is stored in struct cache_sizes (typedef'd to
cache_sizes_t
):
/* Size description struct for general caches. */
typedef struct cache_sizes {
size_t cs_size;
kmem_cache_t *cs_cachep;
kmem_cache_t *cs_dmacachep;
} cache_sizes_t;
- Looking at each field:
-
cs_size
- The size of the memory block. -
cs_cachep
- The cache of blocks for normal memory use. -
cs_dmacachep
- The cache of blocks for use with DMA.
- Because a limited number of these caches exist, a null-terminated static
array, cache_sizes, is initialised at compile time, starting
with 32 bytes on a 4KiB machine or 64 bytes for those systems with larger page
sizes, and reaching
2^17
(131KiB):
static cache_sizes_t cache_sizes[] = {
#if PAGE_SIZE == 4096
{ 32, NULL, NULL},
#endif
{ 64, NULL, NULL},
{ 128, NULL, NULL},
{ 256, NULL, NULL},
{ 512, NULL, NULL},
{ 1024, NULL, NULL},
{ 2048, NULL, NULL},
{ 4096, NULL, NULL},
{ 8192, NULL, NULL},
{ 16384, NULL, NULL},
{ 32768, NULL, NULL},
{ 65536, NULL, NULL},
{131072, NULL, NULL},
{ 0, NULL, NULL}
};
- This array is initialised via kmem_cache_sizes_init() at startup.
-
Given the existence of this cache, the slab allocator is now able to offer a new function, kmalloc(), for use when small memory buffers are required.
-
When a request is received, the appropriate sizes cache is selected and an object is assigned from it.
-
The hard work was done in object/cache/slab allocation, so
kmalloc()
itself is fairly simple.
-
Since there's a
kmalloc()
there needs to be an equivalent free function, kfree(). -
As with
kmalloc()
the hard work is done in object/cache/slab freeing, so thekfree()
function itself is fairly simple too.
-
One of the tasks of the slab allocator is to improve hardware cache use. One important aspect of doing this is to try to use data on the same CPU for as long as possible.
-
Linux achieves this by trying to keep objects in the same CPU cache via a per-CPU object cache (known as a 'cpucache') for each CPU in the system.
-
When allocating or freeing object on SMP systems, they are taken from/placed in the cpucache. When no objects are free on allocation (see 8.3.2), a batch of objects is placed into the pool.
-
When the cpucache pool gets too large, half of them are removed and placed into the global cache. This ensures the hardware cache will be used as long as is possible on the same CPU.
-
Another major benefit is that spinlocks do not have to be held when accessing the CPU pool because we are guaranteed that another CPU won't access the local data. This is important because without the caches, spinlocks would have to be acquired for every allocation and free which would be very expensive.
-
Each cache descriptor has a pointer to an array of cpucache's (struct cpucache_s, typedef'd to
cpucache_t
), described in the cache descriptor ascpucache_t *cpudata[NR_CPUS];
(where NR_CPUS is the maximum number of CPUs a system might have.) -
The
cpucache_t
structure is simple:
typedef struct cpucache_s {
unsigned int avail;
unsigned int limit;
} cpucache_t;
- Looking at each field:
-
avail
- The number of free objects available on this cpucache. -
limit
- The total number of free objects that can exist.
- A helper macro cc_data() is available to retrieve the cpucache for a given cache and processor:
#define cc_data(cachep) \
((cachep)->cpudata[smp_processor_id()])
-
Given a cache descriptor
cachep
this returns the appropriate pointer from the cpucache arraycpudata
via smp_processor_id(). -
Pointers on the cpucache are placed immediately after the
cpucache_t
struct, similar to how objects are stored after a slab descriptor in the on-slab case.
-
To prevent fragmentation, objects are always added or removed from the end of the array.
-
cc_entry() returns a pointer to the first object in the cpucache. It is similar in operation to slab_bufctl():
#define cc_entry(cpucache) \
((void **)(((cpucache_t*)(cpucache))+1))
- To use
cc_entry()
to add an object to the CPU cache (cc
):
void *objp;
...
cc_entry(cc)[cc->avail++] = objp;
- To use
cc_entry()
to remove an object from the CPU cache (cc
):
void *objp;
...
objp = cc_entry(cc)[--cc->avail];
-
When a cache is created, its CPU cache has to be enabled and memory has to be allocated to it via kmalloc(), this is performed by enable_cpucache().
-
enable_cpucache()
calls kmem_tune_cpucache() to allocate memory for each CPU cache. -
A CPU cache cannot exist until the various sizes caches have been enabled, so a global variable g_cpucache_up is used to prevent CPU caches from being enabled prematurely.
-
The function enable_all_cpucaches() cycles through all caches in the cache chain and enables their cpucache and is called via kmem_cpucache_init().
-
After the CPU cache has been set up it can be accessed without locking because a CPU will never access the wrong cpucache, so it is guaranteed safe access to it.
-
When the per-cpu caches have been created or changed, each CPU is signalled by an IPI. It's not sufficientt o change all the values in the cache descriptor because that would lead to cache coherency issues and we'd be back having to use spinlocks.
-
A struct ccupdate_struct_s (typedef'd to
ccupdate_struct_t
) is populated with all the information that each CPU needs:
typedef struct ccupdate_struct_s
{
kmem_cache_t *cachep;
cpucache_t *new[NR_CPUS];
} ccupdate_struct_t;
- Looking at each field:
-
cachep
- The cache being updated. -
new
- An array of the cpucache descriptors for each CPU on the system.
- smp_call_function_all_cpus() is used to get each
CPU to call do_ccupdate_local() which swaps the
information from
ccupdate_struct_t
with the information in the cache descriptor. After the information is swapped, the old data can be deleted.
-
When a cache is being shrunk the first step that needs to be performed is draining the cpucache's of any object that might have via drain_cpu_caches().
-
This is done so that the slab allocator will have a clearer view of which slabs can be freed or not. It matters because if just one object in a slab is placed in a per-CPU cache then the whole slab is not able to be freed.
-
If the system is tight on memory, saving a few milliseconds on allocations has a low priority.
-
When the slab allocator creates a new cache, it allocates the kmem_cache_t from the cache_cache or a
kmem_cache
cache. -
This is chicken-and-egg so the
cache_cache
is statically initialised:
/* internal cache of cache description objs */
static kmem_cache_t cache_cache = {
slabs_full: LIST_HEAD_INIT(cache_cache.slabs_full),
slabs_partial: LIST_HEAD_INIT(cache_cache.slabs_partial),
slabs_free: LIST_HEAD_INIT(cache_cache.slabs_free),
objsize: sizeof(kmem_cache_t),
flags: SLAB_NO_REAP,
spinlock: SPIN_LOCK_UNLOCKED,
colour_off: L1_CACHE_BYTES,
name: "kmem_cache",
};
- Looking at the assignment of each field:
-
slabs_[fill|partial|free]
- Initialises the 3 lists. -
objsize
- Set to the size of a cache descriptor. -
flags
- The creating and deleting of caches is extremely rare, so don't even consider it for reaping. -
spinlock
- Initialised unlocked. -
colour_off
- Aligned to the L1 cache. -
name
- A sane human-readable name.
- The code statically defines all the fields that can be determined at compile-time, the rest are initialised via kmem_cache_init() which is called from start_kernel().
-
The slab allocator uses kmem_getpages() and kmem_freepages() to interact with the physical page allocator to obtain the actual pages of memory it needs.
-
kmem_getpages()
andkmem_freepages()
are essentially wrappers around the buddy allocator API so that slab flags will be taken into account for allocations. -
For allocations, the default flags are taken from
cachep->gfpflags
and the order is taken fromcachep->gfporder
(assumingcachep
is the kmem_cache_t in question) -
When freeing pages, PageClearSlab() will be called for every page being freed before calling free_pages().