-
NUMA - Non-Uniform Memory Access. Memory arranged into banks, incurring a different cost for access depending on their distance from the processor.
-
Each of these banks is called a 'node', represented by struct pglist_data even if the arch is UMA.
-
The struct is always referenced by a typedef,
pg_data_t
. -
Every node is kept on a
NULL
terminated linked list,pgdat_list
, and linked bypg_data_t->node_next
. -
On UMA arches, only one
pg_data_t
structure calledcontig_page_data
is used. -
Each node is divided into blocks called zones, which represent ranges of memory, described by struct zone_struct, typedef-d to
zone_t
, one ofZONE_DMA
,ZONE_NORMAL
orZONE_HIGHMEM
. -
ZONE_DMA
is kept within lower physical memory ranges that certain ISA devices need. -
ZONE_NORMAL
memory is directly mapped into the upper region of the linear address space. -
ZONE_HIGHMEM
is what's is left. -
In a 32-bit kernel the mappings are:
ZONE_DMA - First 16MiB of memory
ZONE_NORMAL - 16MiB - 896MiB
ZONE_HIGHMEM - 896 MiB - End
-
Many kernel operations can only take place in
ZONE_NORMAL
. So this is the most performance critical zone. -
Memory is divided into fixed-size chunks called page frames, represented by struct page (typedef'd to
mem_map_t
), and all of these are kept in a globalmem_map
array, usually stored at the beginning ofZONE_NORMAL
or just after the area reserved for the loaded kernel image in low memory machines (mem_map_t
is a convenient name for accessing elements of this array.) -
Because the amount of memory directly accessible by the kernel (
ZONE_NORMAL
) is limited in size, Linux has the concept of high memory.
- Each node is described by a
pg_data_t
, which is atypedef
for struct pglist_data:
typedef struct pglist_data {
zone_t node_zones[MAX_NR_ZONES];
zonelist_t node_zonelists[GFP_ZONEMASK+1];
int nr_zones;
struct page *node_mem_map;
unsigned long *valid_addr_bitmap;
struct bootmem_data *bdata;
unsigned long node_start_paddr;
unsigned long node_start_mapnr;
unsigned long node_size;
int node_id;
struct pglist_data *node_next;
} pg_data_t;
-
node_zones
-ZONE_HIGHMEM
,ZONE_NORMAL
,ZONE_DMA
. -
node_zonelists
- Order of zones allocations are preferred from. build_zonelists() inmm/page_alloc.c
sets up the order, called by free_area_init_core(). A failed allocation inZONE_HIGHMEM
may fall back toZONE_NORMAL
or back toZONE_DMA
. -
nr_zones
- Number of zones in this node, between 1 and 3 (not all nodes will have all zones.) -
node_mem_map
- First page of the struct page array that represents each physical frame in the node. Will be placed somewhere in themem_map
array. -
valid_addr_bitmap
- Used by sparc/sparc64. -
bdata
- Boot memory information only. -
node_start_paddr
- Starting physical address of the node. -
node_start_mapnr
- Page offset within globalmem_map
. Calculated in free_area_init_core() by determining number of pages betweenmem_map
and localmem_map
for this node calledlmem_map
. -
node_size
- Total number of pages in this zone. -
node_id
- Node ID (NID) of node, starting at 0. -
node_next
- Pointer to next node,NULL
terminated list. -
Nodes are maintained on a list called
pgdat_list
. Nodes are placed on the list as they are initialised by the init_bootmem_core() function. They can be iterated over using for_each_pgdat, e.g.:
pg_data_t *pgdat;
for_each_pgdat(pgdat)
pr_debug("node %d: size=%d", pgdat->node_id, pgdat->node_size);
- Each zone is described by a struct zone_struct (typedef'd to
zone_t
):
typedef struct zone_struct {
/*
* Commonly accessed fields:
*/
spinlock_t lock;
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
int need_balance;
/*
* free areas of different sizes
*/
free_area_t free_area[MAX_ORDER];
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
* wait_table_shift -- wait_table_size
* == BITS_PER_LONG (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
unsigned long wait_table_shift;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
struct page *zone_mem_map;
unsigned long zone_start_paddr;
unsigned long zone_start_mapnr;
/*
* rarely used fields:
*/
char *name;
unsigned long size;
} zone_t;
-
lock
- Spinlock protects the zone from concurrent accesses. -
free_pages
- Total number of free pages in the zone. -
pages_min
,pages_low
,pages_high
- Watermarks - Iffree_pages < pages_low
,kswapd
is woken up and swaps pages out asynchronously. If the page consumption doesn't slow down fast enough from this,kswapd
switches into a mode where pages are freed synchronously in order to return the system to health (see 2.2.1.) -
need_balance
- This indicates tokswapd
that it needs to balance the zone, i.e.free_pages
has hit one of the watermarks. -
free_area
- Free area bitmaps used by the buddy allocator. -
wait_table
- Hash table of wait queues of processes waiting on a page to be freed. This is meaningful to wait_on_page() and unlock_page(). A 'wait table' is used because, if processes all waited on a single queue, there'd be a big race between processes for pages which are locked on wake up (known as a 'thundering herd'.) -
wait_table_size
- Number of queues in the hash table (power of 2.) -
wait_table_shift
- Number of bits in a long - binary logarithm ofwait_table_size
. -
zone_pgdat
- Points to the parentpg_data_t
. -
zone_mem_map
- First page in a globalmem_map
that this zone refers to. -
zone_start_paddr
- Starting physical address of the zone. -
zone_start_mapnr
- Page offset within globalmem_map
. -
name
- String name of the zone -"DMA"
,"Normal"
or"HighMem"
. -
size
- Size of zone in pages.
-
When system memory is low, the pageout daemon
kswapd
is woken up to free pages. -
If memory pressure is high
kswapd
will free memory synchronously - the direct-reclaim path. -
Each zone has 3 watermarks -
pages_min
,pages_low
andpages_high
. -
pages_min
is determined by free_area_init_core() during memory initialisation and is based on a ratio related to the size of the zone in pages, initially aszone_size_in_pages/128
, its value varies from 20 to 255 pages (80KiB - 1MiB on x86.) When this is reached it's time to get serious - memory is synchronously freed. -
pages_low = 2*pages_min
by default. When this amount of free memory is reached,kswapd
is woken up by the 'buddy allocator' in order to start freeing pages. -
pages_high = 3*pages_min
by default. Afterkswapd
has been woken to start freeing pages, the zone won't be considered to be 'balanced' untilpages_high
pages are free again.
-
The size of each zone is calculated during setup_memory().
-
PFN - Page Frame Number - is an offset in pages within the physical memory map.
-
The PFN variables mentioned below are kept in mm/bootmem.c.
-
min_low_pfn
- the first PFN usable by the system - is located in the first page after the global variable_end
(this variable represents the end of the loaded kernel image.) -
max_pfn
- the last page frame in the system - is determined in a very architecture-specific fashion. In x86 the function find_max_pfn() reads through the whole e820 map (a table provided by BIOS describing what physical memory is available, reserved, or non-existent) in order to find the highest page frame. -
max_low_pfn
is calculated on x86 with find_max_low_pfn(), and marks the end ofZONE_NORMAL
. This is the maximum page of physical memory directly accessible by the kernel, and is related to the kernel/username split in the linear address space determined by PAGE_OFFSET. In low memory machinesmax_pfn = max_low_pfn
. -
Once we have these values we can determine the start and end of high memory (
highstart_pfn
andhighend_pfn
) very simply:
highstart_pfn = highend_pfn = max_pfn;
if (max_pfn > max_low_pfn) {
highstart_pfn = max_low_pfn;
}
- These values are used later to initialise the high memory pages for the physical page allocator (see section 5.6)
-
When I/O is being performed on a page such as during page-in or page-out, I/O is locked to avoid exposing inconsistent data.
-
Processes that want to use a page undergoing I/O have to join a wait queue before it can be accessed by calling wait_on_page().
-
When the I/O is complete the page will be unlocked with UnlockPage() (
#define
'd as unlock_page()) and any processes waiting on the queue will be woken up. -
If every page had a wait queue it would use a lot of memory, so instead the wait queue is stored within the relevant zone_t.
-
The process of sleeping on a locked page can be described as follows:
-
Process A wants to lock page.
-
The kernel calls __wait_on_page()...
-
...which calls page_waitqueue() to get the page's wait queue...
-
...which calls page_zone() to obtain the page's zone's zone_t structure using the page's
flags
field shifted byZONE_SHIFT
... -
...page_waitqueue() will then hash the page address to read into the
zone
'swait_table
field and retrieve the appropriate wait_queue_head_t. -
This is used by add_wait_queue() to add the process to the wait queue, at which point it goes beddy byes!
-
As described above, a hash table is used rather than simply keeping a single wait list. This is done because a single list could result in a serious thundering herd problem.
-
In the event of a hash collision processes might still get woken up unnecessarily, but collisions aren't expected that often.
-
The
wait_table
field is allocated during free_area_init_core(), its size is calculated by wait_table_size() and stored in thewait_table_size
field, with a maximum size of 4,096 wait queues. -
For smaller tables, the size of the table is the minimum power of 2 required to store
NoPages / PAGES_PER_WAITQUEUE
(NoPages
is the number of pages in the zone andPAGES_PER_WAITQUEUE
is defined as 256.) This means the size of the table isfloor(log2(2 * NoPages/PAGE_PER_WAITQUEUE - 1))
. -
The filed
zone_t->wait_table_shift
is the number of bits a page address has to be shifted right to return an index within the table (using a hash table as described above.)
-
Zones are initialised after kernel page tables have been fully set up by paging_init(). The idea is to determine what parameters to send to free_area_init() for UMA architectures (where the only parameter required is
zones_size
, or free_area_init_node() for NUMA. -
The parameters are as follows:
void __init free_area_init_node(int nid, pg_data_t *pgdat, struct page *pmap,
unsigned long *zones_size, unsigned long zone_start_paddr,
unsigned long *zholes_size)
-
nid
- The node id. -
pgdat_
- Node'spg_data_t
being initialised, in UMA this will be contig_page_data. -
pmap
- This parameter is determined by free_area_init_core() to point to the beginning of the locally definedlmem_map
array which is ignored in NUMA because it treatsmem_map
as a virtual array starting at PAGE_OFFSET in UMA, this pointer is the globalmem_map
variable. TODO: Check this, seems a bit vague. -
zones_size
- An array containing the size of each zone in pages. -
zone_start_paddr
- Starting physical address for the first zone. -
zone_holes
- An array containing the total size of memory holes in the zones.
- free_area_init_core() is responsible for filling in
each zone_t with the relevant information and allocation of the
mem_map
for the node. Information on which pages are free for the zones is not determined at this point. This information isn't known until the boot memory allocator is being retired (discussed in chapter 5.)
-
The
mem_map
(typemem_map_t
, typedef'd to struct page) area is created during system startup in one two ways - on NUMA systems it is treated as a virtual array starting at PAGE_OFFSET. free_area_init_node() is called for each active node in the system, which allocates the portion of this array for each node being initialised. -
On UMA systems, free_area_init() uses contig_page_data as the node and the global
mem_map
as the localmem_map
for this node. -
free_area_init_core() allocates a local
lmem_map
for the node being initialised. The memory for this array is allocated by the boot memory allocator via alloc_bootmem_node() (which in turn calls __alloc_bootmem_node) - for UMA this newly allocated memory becomes the globalmem_map
, but for NUMA things are slightly different. -
In NUMA, architectures allocate memory for
lmem_map
within each node's own memory. The globalmem_map
is never explicitly allocated, but is set to PAGE_OFFSET which is treated as a virtual array. -
The address of the local map is stored in
pg_data_t->node_mem_map
which exists somewhere in the virtualmem_map
. For each zone that exists in the node, the address within the virtualmem_map
is stored inzone_t->zone_mem_map
. All the rest of the code then treatsmem_map
as a real array, because only valid regions within it will be used by nodes.
- Every physical page 'frame' in the system has an associated struct page used to keep track of its status:
typedef struct page {
struct list_head list; /* ->mapping has some page lists. */
struct address_space *mapping; /* The inode (or ...) we belong to. */
unsigned long index; /* Our offset within mapping. */
struct page *next_hash; /* Next page sharing our hash bucket in
the pagecache hash table. */
atomic_t count; /* Usage count, see below. */
unsigned long flags; /* atomic flags, some possibly
updated asynchronously */
struct list_head lru; /* Pageout list, eg. active_list;
protected by pagemap_lru_lock !! */
struct page **pprev_hash; /* Complement to *next_hash. */
struct buffer_head * buffers; /* Buffer maps us to a disk block. */
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */
} mem_map_t;
-
list
- Pages might belong to many lists, and this field is used as thelist_head
field for those (kernel linked list work using an embedded field.) For example, pages in a mapping will be in one ofclean_pages
,dirty_pages
,locked_pages
kept by an address_space. In the slab allocator, the field is used to store pointers to the slab and cache structures managing the page once it's been allocated by the slab allocator. Additionally, it's used to link blocks of free pages together. -
mapping
- When files or devices are memory mapped, their inode has an associated address_space. This field will point to this address space if the page belongs to the file. If the page is anonymous andmapping
is set, theaddress_space
isswapper_space
which manages the swap address space. -
index
- If the page is part of a file mapping, it is the offset within the file. If the page is part of the swap cache, then this will be the offset within theaddress_space
for the swap address space (swapper_space
.) Alternatively, if a block of pages is being freed for a particular process, the order (power of two number of pages being freed) of the block is stored here, set in __free_pages_ok(). -
next_hash
- Pages that are part of a file mapping are hashed on the inode and offset. This field links pages together that share the same hash bucket. -
count
- This is the reference count of the page - if it drops to zero, the page can be freed. If it is any greater, it is in use by one or more processes or the kernel (e.g. waiting for I/O.) -
flags
- Describe the status of the page as declared in linux/mm.h. The only really interesting flag is SetPageUptodate() which calls an architecture-specific function, arch_set_page_uptodate() (this seems to only actually do something for the S390 and S390-X architectures.)
-
PG_active
- This bit is set if a page is on theactive_list
LRU and cleared when it is removed. It indicates that the page is 'hot'. Set - SetPageActive() Test - PageActive() Clear - ClearPageActive(). -
PG_arch_1
- An architecture-specific page state bit. The generic code guarantees that this bit is cleared for a page when it is first entered into the page cache. This allows an architecture to defer the flushing of the D-cache (see section 3.9) until the page is mapped by a process. Set - None Test - None Clear - None -
PG_checked
- Used by ext2. Set - SetPageChecked() Test - PageChecked() Clear - None -
PG_dirty
- Does the page need to be flushed to disk? This bit ensures a dirty page is not freed before being written out. Set - SetPageDirty() Test - PageDirty() Clear - ClearPageDirty() -
PG_error
- Set if an error occurs during disk I/O. Set - SetPageError() Test - PageError() Clear - ClearPageError() -
PG_fs_1
- Reserved for a file system to use for its own purposes, e.g. NFS uses this to indicate if a page is in sync with the remote server. Set - None Test - None Clear - None -
PG_highmem
- Pages in high memory cannot be mapped permanently by the kernel, so these pages are flagged with this bit during mem_init(). Set - None Test - PageHighMem() Clear - None -
PG_launder
- Useful only for the page replacement policy. When the VM wants to swap out a page, it'll set the bit and call writepage(). When scanning, if it encounters a page withPG_launder|PG_locked
set it will wait for the I/O to complete. Set - SetPageLaunder() Test - PageLaunder() Clear - ClearPageLaunder() -
PG_locked
- Set when the page must be locked in memory for disk I/O. When the I/O starts, this bit is set, when it is complete it is cleared. Set - LockPage() Test - PageLocked() Clear - UnlockPage() -
PG_lru
- If a page is either on theactive_list
or theinactive_list
, this is set. Set - TestSetPageLRU() Test - PageLRU() Clear - TestClearPageLRU() -
PG_referenced
- If a page is mapped and referenced through the mapping/index hash table this bit is set. It's used during page replacement for moving the page around the LRU lists. Set - SetPageReferenced() Test - PageReferenced() Clear - ClearPageReferenced() -
PG_reserved
- Set for pages that can never be swapped out. It is set by the boot memory allocator for pages allocated during system startup. Later, it's used to flag empty pages or ones that don't exist. Set - SetPageReserved() Test - PageReserved() Clear - ClearPageReserved() -
PG_slab
- Indicates the page is being used by the slab allocator. Set - PageSetSlab() Test - PageSlab() Clear - PageClearSlab() -
PG_skip
- Defunct. Used to be used by some Sparc architectures to skip over parts of the address space but is no longer used. Completely removed in 2.6. Set - None Test - None Clear - None -
PG_unused
- Does what it says on the tin. Set - None Test - None Clear - None -
PG_uptodate
- When a page is read from disk without error, this bit will be set. Set - SetPageUptodate() Test - Page_Uptodate() Clear - ClearPageUptodate()
-
lru
- For page replacement policy, pages that may be swapped out will exist on either theactive_list
or theinactive_list
declared in page_alloc.c. This is thestruct list_head
field for these LRU lists (discussed in chapter 10.) -
pprev_hash
- The complement tonext_hash
, making the list doubly-linked. -
buffers
- If a page has buffers for a block device associated with it, this field is used to keep track of the struct buffer_head. An anonymous page mapped by a process may also have an associatedbuffer_head
if it's backed by a swap file. This is necessary because the page has to be synced with backing storage in block-sized chunks defined by the underlying file system. -
virtual
- Normally only pages fromZONE_NORMAL
are directly mapped by the kernel. To address pages inZONE_HIGHMEM
, kmap() (which in turn calls __kmap().) is used to map the page for the kernel (described further in chapter 9.) Only a fixed number of pages may be mapped. When a page is mapped, this field is its virtual address.
-
As recently as 2.4.18, a struct page stored a reference to its zone in
page->zone
. This is wasteful as with thousands of pages, these pointers add up. -
In 2.4.22 the
zone
field is gone andpage->flags
is shifted by ZONE_SHIFT to determine the zone the pages belongs to. -
In order for this to be used to determine the zone, we start by declaring zone_table (
EXPORT_SYMBOL
makeszone_table
accessible to loadable modules.) This is treated like a multi-dimensional array of nodes and zones:
zone_t *zone_table[MAX_NR_ZONES*MAX_NR_NODES];
EXPORT_SYMBOL(zone_table);
-
MAX_NR_ZONES is the maximum number of zones that can be in a node (i.e. 3.) MAX_NR_NODES is the maximum number of nodes that can exist.
-
During free_area_init_core() all the pages in a node are initialised. First it sets the value for the table, where
nid
is the node ID,j
is the zone index andzone
is thezone_t
struct:
zone_table[nid * MAX_NR_ZONES + j] = zone;
- For each page, the function set_page_zone() is called via:
set_page_zone(page, nid * MAX_NR_ZONES + j);
- set_page_zone() is defined as follows, which shows how this is used in conjunction with ZONE_SHIFT to encode a page's zone:
static inline void set_page_zone(struct page *page, unsigned long zone_num)
{
page->flags &= ~(~0UL << ZONE_SHIFT);
page->flags |= zone_num << ZONE_SHIFT;
}
-
Because memory in the
ZONE_NORMAL
zone is limited in size, the kernel supports the concept of 'high memory'. -
Two thresholds of high memory exist on 32-bit x86 systems, one at 4GiB, and a second at 64GiB. 32-bit systems can only address 4GiB of RAM, but with PAE enabled 64 bits of memory can be addressed, though not all at once of course.
-
Each page uses 44 bytes of memory in
ZONE_NORMAL
. This means 4GiB requires 44MiB of kernel memory to describe it. At 16GiB, 176MiB is consumed, and once you factor in other structures, even smaller ones like Page Table Entries (PTEs), which require 16MiB in the worst case, it adds up. This makes 16GiB the maximum practical limit on a 32-bit system. At this point you should switch to 64-bit if you want access to more memory.