Chapter 2: Describing Physical Memory

NUMA - Non-Uniform Memory Access. Memory arranged into banks, incurring a different cost for access depending on their distance from the processor.
Each of these banks is called a 'node', represented by struct pglist_data even if the arch is UMA.
The struct is always referenced by a typedef, pg_data_t.
Every node is kept on a NULL terminated linked list, pgdat_list, and linked by pg_data_t->node_next.
On UMA arches, only one pg_data_t structure called contig_page_data is used.
Each node is divided into blocks called zones, which represent ranges of memory, described by struct zone_struct, typedef-d to zone_t, one of ZONE_DMA, ZONE_NORMAL or ZONE_HIGHMEM.
ZONE_DMA is kept within lower physical memory ranges that certain ISA devices need.
ZONE_NORMAL memory is directly mapped into the upper region of the linear address space.
ZONE_HIGHMEM is what's is left.
In a 32-bit kernel the mappings are:

ZONE_DMA - First 16MiB of memory
ZONE_NORMAL - 16MiB - 896MiB
ZONE_HIGHMEM - 896 MiB - End

Many kernel operations can only take place in ZONE_NORMAL. So this is the most performance critical zone.
Memory is divided into fixed-size chunks called page frames, represented by struct page (typedef'd to mem_map_t), and all of these are kept in a global mem_map array, usually stored at the beginning of ZONE_NORMAL or just after the area reserved for the loaded kernel image in low memory machines (mem_map_t is a convenient name for accessing elements of this array.)
Because the amount of memory directly accessible by the kernel (ZONE_NORMAL) is limited in size, Linux has the concept of high memory.

2.1 Nodes

Each node is described by a pg_data_t, which is a typedef for struct pglist_data:

typedef struct pglist_data {
	zone_t node_zones[MAX_NR_ZONES];
	zonelist_t node_zonelists[GFP_ZONEMASK+1];
	int nr_zones;
	struct page *node_mem_map;
	unsigned long *valid_addr_bitmap;
	struct bootmem_data *bdata;
	unsigned long node_start_paddr;
	unsigned long node_start_mapnr;
	unsigned long node_size;
	int node_id;
	struct pglist_data *node_next;
 } pg_data_t;

Fields

node_zones - ZONE_HIGHMEM, ZONE_NORMAL, ZONE_DMA.
node_zonelists - Order of zones allocations are preferred from. build_zonelists() in mm/page_alloc.c sets up the order, called by free_area_init_core(). A failed allocation in ZONE_HIGHMEM may fall back to ZONE_NORMAL or back to ZONE_DMA.
nr_zones - Number of zones in this node, between 1 and 3 (not all nodes will have all zones.)
node_mem_map - First page of the struct page array that represents each physical frame in the node. Will be placed somewhere in the mem_map array.
valid_addr_bitmap - Used by sparc/sparc64.
bdata - Boot memory information only.
node_start_paddr - Starting physical address of the node.
node_start_mapnr - Page offset within global mem_map. Calculated in free_area_init_core() by determining number of pages between mem_map and local mem_map for this node called lmem_map.
node_size - Total number of pages in this zone.
node_id - Node ID (NID) of node, starting at 0.
node_next - Pointer to next node, NULL terminated list.
Nodes are maintained on a list called pgdat_list. Nodes are placed on the list as they are initialised by the init_bootmem_core() function. They can be iterated over using for_each_pgdat, e.g.:

pg_data_t *pgdat;

for_each_pgdat(pgdat)
	pr_debug("node %d: size=%d", pgdat->node_id, pgdat->node_size);

2.2 Zones

Each zone is described by a struct zone_struct (typedef'd to zone_t):

typedef struct zone_struct {
	/*
	 * Commonly accessed fields:
	 */
	spinlock_t              lock;
	unsigned long           free_pages;
	unsigned long           pages_min, pages_low, pages_high;
	int                     need_balance;

	/*
	 * free areas of different sizes
	 */
	free_area_t             free_area[MAX_ORDER];

	/*
	 * wait_table           -- the array holding the hash table
	 * wait_table_size      -- the size of the hash table array
	 * wait_table_shift     -- wait_table_size
	 *                              == BITS_PER_LONG (1 << wait_table_bits)
	 *
	 * The purpose of all these is to keep track of the people
	 * waiting for a page to become available and make them
	 * runnable again when possible. The trouble is that this
	 * consumes a lot of space, especially when so few things
	 * wait on pages at a given time. So instead of using
	 * per-page waitqueues, we use a waitqueue hash table.
	 *
	 * The bucket discipline is to sleep on the same queue when
	 * colliding and wake all in that wait queue when removing.
	 * When something wakes, it must check to be sure its page is
	 * truly available, a la thundering herd. The cost of a
	 * collision is great, but given the expected load of the
	 * table, they should be so rare as to be outweighed by the
	 * benefits from the saved space.
	 *
	 * __wait_on_page() and unlock_page() in mm/filemap.c, are the
	 * primary users of these fields, and in mm/page_alloc.c
	 * free_area_init_core() performs the initialization of them.
	 */
	wait_queue_head_t       * wait_table;
	unsigned long           wait_table_size;
	unsigned long           wait_table_shift;

	/*
	 * Discontig memory support fields.
	 */
	struct pglist_data      *zone_pgdat;
	struct page             *zone_mem_map;
	unsigned long           zone_start_paddr;
	unsigned long           zone_start_mapnr;

	/*
	 * rarely used fields:
	 */
	char                    *name;
	unsigned long           size;
 } zone_t;

Fields

lock - Spinlock protects the zone from concurrent accesses.
free_pages - Total number of free pages in the zone.
pages_min, pages_low, pages_high - Watermarks - If free_pages < pages_low, kswapd is woken up and swaps pages out asynchronously. If the page consumption doesn't slow down fast enough from this, kswapd switches into a mode where pages are freed synchronously in order to return the system to health (see 2.2.1.)
need_balance - This indicates to kswapd that it needs to balance the zone, i.e. free_pages has hit one of the watermarks.
free_area - Free area bitmaps used by the buddy allocator.
wait_table - Hash table of wait queues of processes waiting on a page to be freed. This is meaningful to wait_on_page() and unlock_page(). A 'wait table' is used because, if processes all waited on a single queue, there'd be a big race between processes for pages which are locked on wake up (known as a 'thundering herd'.)
wait_table_size - Number of queues in the hash table (power of 2.)
wait_table_shift - Number of bits in a long - binary logarithm of wait_table_size.
zone_pgdat - Points to the parent pg_data_t.
zone_mem_map - First page in a global mem_map that this zone refers to.
zone_start_paddr - Starting physical address of the zone.
zone_start_mapnr - Page offset within global mem_map.
name - String name of the zone - "DMA", "Normal" or "HighMem".
size - Size of zone in pages.

2.2.1 Zone Watermarks

When system memory is low, the pageout daemon kswapd is woken up to free pages.
If memory pressure is high kswapd will free memory synchronously - the direct-reclaim path.
Each zone has 3 watermarks - pages_min, pages_low and pages_high.
pages_min is determined by free_area_init_core() during memory initialisation and is based on a ratio related to the size of the zone in pages, initially as zone_size_in_pages/128, its value varies from 20 to 255 pages (80KiB - 1MiB on x86.) When this is reached it's time to get serious - memory is synchronously freed.
pages_low = 2*pages_min by default. When this amount of free memory is reached, kswapd is woken up by the 'buddy allocator' in order to start freeing pages.
pages_high = 3*pages_min by default. After kswapd has been woken to start freeing pages, the zone won't be considered to be 'balanced' until pages_high pages are free again.

2.2.2 Calculating the Sizes of Zones

The size of each zone is calculated during setup_memory().
PFN - Page Frame Number - is an offset in pages within the physical memory map.
The PFN variables mentioned below are kept in mm/bootmem.c.
min_low_pfn - the first PFN usable by the system - is located in the first page after the global variable _end (this variable represents the end of the loaded kernel image.)
max_pfn - the last page frame in the system - is determined in a very architecture-specific fashion. In x86 the function find_max_pfn() reads through the whole e820 map (a table provided by BIOS describing what physical memory is available, reserved, or non-existent) in order to find the highest page frame.
max_low_pfn is calculated on x86 with find_max_low_pfn(), and marks the end of ZONE_NORMAL. This is the maximum page of physical memory directly accessible by the kernel, and is related to the kernel/username split in the linear address space determined by PAGE_OFFSET. In low memory machines max_pfn = max_low_pfn.
Once we have these values we can determine the start and end of high memory (highstart_pfn and highend_pfn) very simply:

	highstart_pfn = highend_pfn = max_pfn;
	if (max_pfn > max_low_pfn) {
		highstart_pfn = max_low_pfn;
	}

These values are used later to initialise the high memory pages for the physical page allocator (see section 5.6)

2.2.3 Zone Wait Queue Table

When I/O is being performed on a page such as during page-in or page-out, I/O is locked to avoid exposing inconsistent data.
Processes that want to use a page undergoing I/O have to join a wait queue before it can be accessed by calling wait_on_page().
When the I/O is complete the page will be unlocked with UnlockPage() (#define'd as unlock_page()) and any processes waiting on the queue will be woken up.
If every page had a wait queue it would use a lot of memory, so instead the wait queue is stored within the relevant zone_t.
The process of sleeping on a locked page can be described as follows:

Process A wants to lock page.
The kernel calls __wait_on_page()...
...which calls page_waitqueue() to get the page's wait queue...
...which calls page_zone() to obtain the page's zone's zone_t structure using the page's flags field shifted by ZONE_SHIFT...
...page_waitqueue() will then hash the page address to read into the zone's wait_table field and retrieve the appropriate wait_queue_head_t.
This is used by add_wait_queue() to add the process to the wait queue, at which point it goes beddy byes!

As described above, a hash table is used rather than simply keeping a single wait list. This is done because a single list could result in a serious thundering herd problem.
In the event of a hash collision processes might still get woken up unnecessarily, but collisions aren't expected that often.
The wait_table field is allocated during free_area_init_core(), its size is calculated by wait_table_size() and stored in the wait_table_size field, with a maximum size of 4,096 wait queues.
For smaller tables, the size of the table is the minimum power of 2 required to store NoPages / PAGES_PER_WAITQUEUE (NoPages is the number of pages in the zone and PAGES_PER_WAITQUEUE is defined as 256.) This means the size of the table is floor(log2(2 * NoPages/PAGE_PER_WAITQUEUE - 1)).
The filed zone_t->wait_table_shift is the number of bits a page address has to be shifted right to return an index within the table (using a hash table as described above.)

2.3 Zone Initialisation

Zones are initialised after kernel page tables have been fully set up by paging_init(). The idea is to determine what parameters to send to free_area_init() for UMA architectures (where the only parameter required is zones_size, or free_area_init_node() for NUMA.
The parameters are as follows:

void __init free_area_init_node(int nid, pg_data_t *pgdat, struct page *pmap,
	unsigned long *zones_size, unsigned long zone_start_paddr,
	unsigned long *zholes_size)

nid - The node id.
pgdat_ - Node's pg_data_t being initialised, in UMA this will be contig_page_data.
pmap - This parameter is determined by free_area_init_core() to point to the beginning of the locally defined lmem_map array which is ignored in NUMA because it treats mem_map as a virtual array starting at PAGE_OFFSET in UMA, this pointer is the global mem_map variable. TODO: Check this, seems a bit vague.
zones_size - An array containing the size of each zone in pages.
zone_start_paddr - Starting physical address for the first zone.
zone_holes - An array containing the total size of memory holes in the zones.

free_area_init_core() is responsible for filling in each zone_t with the relevant information and allocation of the mem_map for the node. Information on which pages are free for the zones is not determined at this point. This information isn't known until the boot memory allocator is being retired (discussed in chapter 5.)

2.4 Initialising mem_map

The mem_map (type mem_map_t, typedef'd to struct page) area is created during system startup in one two ways - on NUMA systems it is treated as a virtual array starting at PAGE_OFFSET. free_area_init_node() is called for each active node in the system, which allocates the portion of this array for each node being initialised.
On UMA systems, free_area_init() uses contig_page_data as the node and the global mem_map as the local mem_map for this node.
free_area_init_core() allocates a local lmem_map for the node being initialised. The memory for this array is allocated by the boot memory allocator via alloc_bootmem_node() (which in turn calls __alloc_bootmem_node) - for UMA this newly allocated memory becomes the global mem_map, but for NUMA things are slightly different.
In NUMA, architectures allocate memory for lmem_map within each node's own memory. The global mem_map is never explicitly allocated, but is set to PAGE_OFFSET which is treated as a virtual array.
The address of the local map is stored in pg_data_t->node_mem_map which exists somewhere in the virtual mem_map. For each zone that exists in the node, the address within the virtual mem_map is stored in zone_t->zone_mem_map. All the rest of the code then treats mem_map as a real array, because only valid regions within it will be used by nodes.

2.5 Pages

Every physical page 'frame' in the system has an associated struct page used to keep track of its status:

typedef struct page {
        struct list_head list;          /* ->mapping has some page lists. */
        struct address_space *mapping;  /* The inode (or ...) we belong to. */
        unsigned long index;            /* Our offset within mapping. */
        struct page *next_hash;         /* Next page sharing our hash bucket in
                                           the pagecache hash table. */
        atomic_t count;                 /* Usage count, see below. */
        unsigned long flags;            /* atomic flags, some possibly
                                           updated asynchronously */
        struct list_head lru;           /* Pageout list, eg. active_list;
                                           protected by pagemap_lru_lock !! */
        struct page **pprev_hash;       /* Complement to *next_hash. */
        struct buffer_head * buffers;   /* Buffer maps us to a disk block. */

        /*
         * On machines where all RAM is mapped into kernel address space,
         * we can simply calculate the virtual address. On machines with
         * highmem some memory is mapped into kernel virtual memory
         * dynamically, so we need a place to store that address.
         * Note that this field could be 16 bits on x86 ... ;)
         *
         * Architectures with slow multiplication can define
         * WANT_PAGE_VIRTUAL in asm/page.h
         */
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
        void *virtual;                  /* Kernel virtual address (NULL if
                                           not kmapped, ie. highmem) */
#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */
} mem_map_t;

Fields

list - Pages might belong to many lists, and this field is used as the list_head field for those (kernel linked list work using an embedded field.) For example, pages in a mapping will be in one of clean_pages, dirty_pages, locked_pages kept by an address_space. In the slab allocator, the field is used to store pointers to the slab and cache structures managing the page once it's been allocated by the slab allocator. Additionally, it's used to link blocks of free pages together.
mapping - When files or devices are memory mapped, their inode has an associated address_space. This field will point to this address space if the page belongs to the file. If the page is anonymous and mapping is set, the address_space is swapper_space which manages the swap address space.
index - If the page is part of a file mapping, it is the offset within the file. If the page is part of the swap cache, then this will be the offset within the address_space for the swap address space (swapper_space.) Alternatively, if a block of pages is being freed for a particular process, the order (power of two number of pages being freed) of the block is stored here, set in __free_pages_ok().
next_hash - Pages that are part of a file mapping are hashed on the inode and offset. This field links pages together that share the same hash bucket.
count - This is the reference count of the page - if it drops to zero, the page can be freed. If it is any greater, it is in use by one or more processes or the kernel (e.g. waiting for I/O.)
flags - Describe the status of the page as declared in linux/mm.h. The only really interesting flag is SetPageUptodate() which calls an architecture-specific function, arch_set_page_uptodate() (this seems to only actually do something for the S390 and S390-X architectures.)

PG_active - This bit is set if a page is on the active_list LRU and cleared when it is removed. It indicates that the page is 'hot'. Set - SetPageActive() Test - PageActive() Clear - ClearPageActive().
PG_arch_1 - An architecture-specific page state bit. The generic code guarantees that this bit is cleared for a page when it is first entered into the page cache. This allows an architecture to defer the flushing of the D-cache (see section 3.9) until the page is mapped by a process. Set - None Test - None Clear - None
PG_checked - Used by ext2. Set - SetPageChecked() Test - PageChecked() Clear - None
PG_dirty - Does the page need to be flushed to disk? This bit ensures a dirty page is not freed before being written out. Set - SetPageDirty() Test - PageDirty() Clear - ClearPageDirty()
PG_error - Set if an error occurs during disk I/O. Set - SetPageError() Test - PageError() Clear - ClearPageError()
PG_fs_1 - Reserved for a file system to use for its own purposes, e.g. NFS uses this to indicate if a page is in sync with the remote server. Set - None Test - None Clear - None
PG_highmem - Pages in high memory cannot be mapped permanently by the kernel, so these pages are flagged with this bit during mem_init(). Set - None Test - PageHighMem() Clear - None
PG_launder - Useful only for the page replacement policy. When the VM wants to swap out a page, it'll set the bit and call writepage(). When scanning, if it encounters a page with PG_launder|PG_locked set it will wait for the I/O to complete. Set - SetPageLaunder() Test - PageLaunder() Clear - ClearPageLaunder()
PG_locked - Set when the page must be locked in memory for disk I/O. When the I/O starts, this bit is set, when it is complete it is cleared. Set - LockPage() Test - PageLocked() Clear - UnlockPage()
PG_lru - If a page is either on the active_list or the inactive_list, this is set. Set - TestSetPageLRU() Test - PageLRU() Clear - TestClearPageLRU()
PG_referenced - If a page is mapped and referenced through the mapping/index hash table this bit is set. It's used during page replacement for moving the page around the LRU lists. Set - SetPageReferenced() Test - PageReferenced() Clear - ClearPageReferenced()
PG_reserved - Set for pages that can never be swapped out. It is set by the boot memory allocator for pages allocated during system startup. Later, it's used to flag empty pages or ones that don't exist. Set - SetPageReserved() Test - PageReserved() Clear - ClearPageReserved()
PG_slab - Indicates the page is being used by the slab allocator. Set - PageSetSlab() Test - PageSlab() Clear - PageClearSlab()
PG_skip - Defunct. Used to be used by some Sparc architectures to skip over parts of the address space but is no longer used. Completely removed in 2.6. Set - None Test - None Clear - None
PG_unused - Does what it says on the tin. Set - None Test - None Clear - None
PG_uptodate - When a page is read from disk without error, this bit will be set. Set - SetPageUptodate() Test - Page_Uptodate() Clear - ClearPageUptodate()

lru - For page replacement policy, pages that may be swapped out will exist on either the active_list or the inactive_list declared in page_alloc.c. This is the struct list_head field for these LRU lists (discussed in chapter 10.)
pprev_hash - The complement to next_hash, making the list doubly-linked.
buffers - If a page has buffers for a block device associated with it, this field is used to keep track of the struct buffer_head. An anonymous page mapped by a process may also have an associated buffer_head if it's backed by a swap file. This is necessary because the page has to be synced with backing storage in block-sized chunks defined by the underlying file system.
virtual - Normally only pages from ZONE_NORMAL are directly mapped by the kernel. To address pages in ZONE_HIGHMEM, kmap() (which in turn calls __kmap().) is used to map the page for the kernel (described further in chapter 9.) Only a fixed number of pages may be mapped. When a page is mapped, this field is its virtual address.

2.6 Mapping Pages to Zones

As recently as 2.4.18, a struct page stored a reference to its zone in page->zone. This is wasteful as with thousands of pages, these pointers add up.
In 2.4.22 the zone field is gone and page->flags is shifted by ZONE_SHIFT to determine the zone the pages belongs to.
In order for this to be used to determine the zone, we start by declaring zone_table (EXPORT_SYMBOL makes zone_table accessible to loadable modules.) This is treated like a multi-dimensional array of nodes and zones:

zone_t *zone_table[MAX_NR_ZONES*MAX_NR_NODES];
EXPORT_SYMBOL(zone_table);

MAX_NR_ZONES is the maximum number of zones that can be in a node (i.e. 3.) MAX_NR_NODES is the maximum number of nodes that can exist.
During free_area_init_core() all the pages in a node are initialised. First it sets the value for the table, where nid is the node ID, j is the zone index and zone is the zone_t struct:

zone_table[nid * MAX_NR_ZONES + j] = zone;

For each page, the function set_page_zone() is called via:

set_page_zone(page, nid * MAX_NR_ZONES + j);

set_page_zone() is defined as follows, which shows how this is used in conjunction with ZONE_SHIFT to encode a page's zone:

static inline void set_page_zone(struct page *page, unsigned long zone_num)
{
	page->flags &= ~(~0UL << ZONE_SHIFT);
	page->flags |= zone_num << ZONE_SHIFT;
}

2.7 High Memory

Because memory in the ZONE_NORMAL zone is limited in size, the kernel supports the concept of 'high memory'.
Two thresholds of high memory exist on 32-bit x86 systems, one at 4GiB, and a second at 64GiB. 32-bit systems can only address 4GiB of RAM, but with PAE enabled 64 bits of memory can be addressed, though not all at once of course.
Each page uses 44 bytes of memory in ZONE_NORMAL. This means 4GiB requires 44MiB of kernel memory to describe it. At 16GiB, 176MiB is consumed, and once you factor in other structures, even smaller ones like Page Table Entries (PTEs), which require 16MiB in the worst case, it adds up. This makes 16GiB the maximum practical limit on a 32-bit system. At this point you should switch to 64-bit if you want access to more memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.md

2.md

Chapter 2: Describing Physical Memory

2.1 Nodes

Fields

2.2 Zones

Fields

2.2.1 Zone Watermarks

2.2.2 Calculating the Sizes of Zones

2.2.3 Zone Wait Queue Table

2.3 Zone Initialisation

2.4 Initialising mem_map

2.5 Pages

Fields

2.6 Mapping Pages to Zones

2.7 High Memory

Files

2.md

Latest commit

History

2.md

File metadata and controls

Chapter 2: Describing Physical Memory

2.1 Nodes

Fields

2.2 Zones

Fields

2.2.1 Zone Watermarks

2.2.2 Calculating the Sizes of Zones

2.2.3 Zone Wait Queue Table

2.3 Zone Initialisation

2.4 Initialising mem_map

2.5 Pages

Fields

2.6 Mapping Pages to Zones

2.7 High Memory