Skip to content

Latest commit

 

History

History
417 lines (327 loc) · 22.2 KB

5.md

File metadata and controls

417 lines (327 loc) · 22.2 KB

Chapter 5: Boot Memory Allocator

  • It's just not practical to statically initialise all the core kernel memory structures at compile-time, because there are too many permutations of hardware configurations out there.

  • To set up even the most basic structures requires memory - even the physical page allocator needs to allocate memory to initialise itself. But how?

  • To get around this issue, a specialised allocator called the 'boot memory allocator' is used.

  • It uses the most basic allocator algorithm - first fit. This means the first available space that can store the requested amount of memory is used.

  • Additionally the allocator uses a bitmap to represent memory rather than linked lists of free blocks. If a bit is 1, the page is allocated, if it is 0, it is unallocated.

  • To satisfy allocations of sizes smaller than a page, the allocator records the Page Frame Number (PFN) - the index of the page, and an offset within that page, so subsequent small allocations can be merged together and stored on the same page.

  • The reason this allocator is not just used as the allocator overall is that, despite it not suffering too badly from fragmentation, memory often has to be linearly searched to satisfy an allocation. Since bitmaps are being examined, the search is expensive and this is made worse by the fact that this algorithm tends to leave many small free blocks at the beginning of physical memory which are scanned for large allocations, which is very wasteful.

  • There are two distinct APIs for the allocator - one for UMA architectures, and another for NUMA architectures. The principle difference is that the NUMA API requires specification of the node affected by the operation.

  • Let's examine the API:

  1. init_bootmem() - Initialises the memory between 0 and the specified PFN page - the beginning of usable memory is at the specified PFN start. init_bootmem_node() is the NUMA equivalent.

  2. reserve_bootmem() - Marks the pages in the specified range reserved. Requests to partially reserve a page will result in the full page being reserved. reserve_bootmem_node() is the NUMA equivalent.

  3. free_bootmem() - Marks the pages in the specified range free. free_bootmem_node() is the NUMA equivalent.

  4. alloc_bootmem() - Allocates specified size bytes from ZONE_NORMAL. The allocation will be aligned to the L1 hardware cache for efficiency. alloc_bootmem_node() is the NUMA equivalent. alloc_bootmem_pages_node() is the NUMA equivalent.

  5. alloc_bootmem_low() - Allocates specified size bytes from ZONE_DMA. As with alloc_bootmem(), the allocation will be L1-aligned. alloc_bootmem_low_pages_node() is the NUMA equivalent.

  6. alloc_bootmem_pages() - Allocates specified size bytes from ZONE_NORMAL, aligned on page size so that full pages will be returned to the caller.

  7. alloc_bootmem_low_pages() - Same as alloc_bootmem_pages() only from ZONE_DMA.

  8. bootmem_bootmap_pages() - Calculates the number of pages required to store a bitmap representing the allocation state of the specified pages number of pages.

  9. free_all_bootmem() - Used at the end of the useful life of the boot allocator - cycles through all pages in the bitmap. For each that is free its flags are cleared and the page is freed to the physical page allocator (see chapter 6) so the runtime allocator can set up its free lists. free_all_bootmem_node() is the NUMA equivalent.

5.1 Representing the Boot Map

  • struct bootmem_data, typedef-d to bootmem_data_t, exists for each NUMA node of memory in the system. It contains information needed for the boot memory allocator to allocate memory for a node - the bitmap representing allocated pages and where the memory is located:
/*
 * node_bootmem_map is a map pointer - the bits represent all physical
 * memory pages (including holes) on the node.
 */
typedef struct bootmem_data {
        unsigned long node_boot_start;
        unsigned long node_low_pfn;
        void *node_bootmem_map;
        unsigned long last_offset;
        unsigned long last_pos;
} bootmem_data_t;
  • Looking at each field:
  1. node_boot_start - The starting physical address of the represented block.
  2. node_low_pfn - The end physical address of the block (LS - likely exclusive bound), in other words the end of the ZONE_NORMAL this node represents.
  3. node_bootmem_map - The virtual address of the bitmap representing allocated or free pages with each bit.
  4. last_offset - The offset within the page of the end of the last allocation. If 0, the page is full.
  5. last_pos - The PFN of the page used with the last allocation. Combined with last_offset a test can be made to see if allocations can be merged into the last page rather than needing to allocate a whole new page.

5.2 Initialising the Boot Memory Allocator

  • Each architecture is required to supply a setup_arch() function which, among other tasks, is responsible for determining the necessary parameters to initialise the boot memory allocator.

  • Each architecture additionally has its own function to get the necessary parameters. On i386, this is achieved via setup_memory() (as discussed in 2.2.2.)

  • setup_memory() calculates the following parameters:

  1. min_low_pfn - The lowest PFN available in the system.

  2. max_low_pfn - The highest PFN that may be addressed by low memory (ZONE_NORMAL.)

  3. highstart_pfn - This is the PFN of the beginning of high memory (ZONE_HIGHMEM.)

  4. highend_pfn - This is the last PFN in high memory.

  5. max_pfn - This is the last PFN available to the system.

5.3 Initialising bootmem_data

  • After the limits of usable physical memory are discovered by setup_memory(), one of two boot memory initialisation functions is selected and provided with the start and end PFN for the node to be initialised.

  • init_bootmem(), which initialises contig_page_data, is used by UMA architectures, whereas init_bootmem_node() is used by NUMA systems to initialise a specified node. Both functions are trivial and rely on init_bootmem_core() to do the real work.

  • init_bootmem_core() does the following:

  1. Inserts the node's pg_data_t into pgdat_list - at the end of this function this node is ready for use and needs to be established.

  2. Records the starting and end address this node in its associated bootmem_data_t structure.

  3. Allocates the bitmap representing page allocations. The size (in bytes, hence the division by 8) of the bitmap required is calculated as mapsize = (end_pfn - start_pfn + 7)/8.

  4. The bitmap is stored at the physical address pointed to by bootmem_data_t->node_boot_start, and the virtual address to this map is placed in bootmem_data_t->node_bootmem_map.

  5. Initialises the entire bitmap to 1, effectively marking all pages allocated, so in the next step it can determine available memory and mark them unallocated hence available.

  6. On i386, register_bootmem_low_pages() is used to read through the e820 map, calling free_bootmem() for each usable page to set the bit 0 (marking it available for allocation), before calling reserve_bootmem() to reserve the pages needed by the actual bitmap itself.

5.4 Allocating Memory

  1. (NUMA-only) pgdat - The node to allocate from. In the UMA case it is assumed to be contig_page_data.

  2. size - The size in bytes of the requested allocation.

  3. align - The number of bytes the request should be aligned to. For small allocations, they are aligned to SMP_CACHE_BYTES which on i386 will align to the L1 hardware cache.

  4. goal - The preferred starting physical address to begin allocating from. The _low functions will start from physical address 0, whereas the others will begin from MAX_DMA_ADDRESS - the maximum address DMA transfers may be made from on the architecture.

  1. Linearly scan memory starting from goal for a block of memory large enough to satisfy the allocation.

  2. Decides whether or not the new allocation can be merged with the previous one. A merge will be performed if the page for the previous allocation (bootmem_data->pos) is adjacent to the page found for this allocation, the previous page has some free space in it (bootmem_data->offset != 0), and the alignment is less than PAGE_SIZE.

  3. Updates pos and offset fields to store the last page used for allocating and how much of the last page was used. If the page was fully used, offset will be set to 0.

5.5 Freeing Memory

  • Freeing is performed via either free_bootmem() or free_bootmem_node() for UMA and NUMA systems respectively.

  • Both functions ultimately call free_bootmem_core(), free_bootmem() passes contig_page_data.bdata and free_bootmem_node() passes the appropriate pgdat->bdata.

  • free_bootmem_core() is relatively simple compared to other allocator functions. For each full page affected by the free, the corresponding bit in the bitmap is set to 0, with double-frees highlighted via BUG().

  • An important restriction of the free functions is that they may only be used for freeing full pages. The code rounds down bitmap indexes to an integer value, meaning that the page containing the remaining portion of memory is left marked as allocated and thus is considered reserved.

  • This ultimately doesn't turn out to be a huge problem as the allocations persist for the lifetime of the system, however it's an important restriction be aware of during boot time.

5.6 retiring the Boot Memory Allocator

  • Late in the bootstrapping process, the function start_kernel() is called, which knows it is safe to remove the boot allocator and its associated data structures.

  • Each architecture is required to provide a mem_init() function that is responsible for:

  1. destroying the boot memory allocator and its associated structures.

  2. Calculating the dimensions of low and high memory and outputting a message to a user describing these.

  3. Providing final initialisations of hardware if necessary.

  1. Clears the PG_reserved flag in its struct page.

  2. Sets the page->count field to 1.

  3. Calls __free_pages() so the buddy allocator (discussed in chapter 6) call build its free liss.

  • Next, it frees all pages used for the bitmap and gives them to the buddy allocator.

  • At this stage, the buddy allocator has control of all the pages in low memory, leaving only the high memory pages left to deal with.

  • After free_all_bootmem() returns, free_pages_init() counts the number of reserved pages for accounting purposes.

  • After this, free_pages_init() calls one_highpage_init() for every page between highstart_pfn and highend_pfn.

  • one_highpage_init() does the same as free_all_bootmem_core() did for the low memory - clears the PG_reserved flag, sets the count to 1 and calls __free_pages() to hand them over to the buddy allocator.

  • At this point, the boot memory allocator is no longer required and the buddy allocator is the main physical page allocator for the system.

  • An interesting point to note is that not only is the data used for bootstrapping removed, so is the code. All initialisation functions that are required only during system startup are marked __init, e.g. unsigned long __init free_all_bootmem(void);

  • All (non-module) uses of __init cause the functions specified this way to be placed together in the .init section of the kernel binary (typically vmlinux is an ELF format binary), and (on i386) free_initmem() walks all the pages from __init_begin to __init_end, potentially freeing a significant amount of RAM.

  • The following flowchart describes how the global mem_map array is allocated and initialised, and how the pages are given to the main allocator:

                  -----------------------------------------------------
                  | Initialise mem_map struct page array with bootmem |
                  |   On retirement, give pages to buddy allocator    |
                  -----------------------------------------------------
                                            |
                                            |
                             /--------------X-----------------------------\
                             |                                            |
                             |                                            v
                             |                               ----------------------------
                             |                               | Retire bootmem allocator |
                             |                               |  with free_pages_init()  |
                             |                               ----------------------------
                             v                                            |
                 --------------------------                               |
                 | Allocate mem_map array |                               v
                 --------------------------                  ----------------------------
                             |                               | free_all_bootmem() which |
                             |                               |     ultimately calls     |
                             v                               | free_all_bootmem_core()  |
                 -------------------------                   ----------------------------
                 | free_area_init_core() |                                |
                 -------------------------                                |
                             |                                            v
                             |                             -------------------------------
              /--------------X--------------\              |  For all pages that are not |
              |                             |              |  allocated by bootmem, call |
              v                             |              | FreePageReserved to mark it |
   ------------------------                 |              | free and set page->count to |
   | alloc_bootmem_node() |                 |              |   1 with set_page_count()   |
   ------------------------                 |              -------------------------------
              |                             |                             |
              |                             |                             |
              v                             v                             v
------------------------------ --------------------------- ------------------------------
| alloc_bootmem_core() uses  | |  Call SetPageReserved() | | Call __free_page() as the  |
|     pg_data_t->bata to     | |   record what zone the  | |  struct page looks like a  |
|    allocate a block of     | |   page belongs to with  | | normally allocated page to |
| memory that is zero-filled | | set_page_zone() and set | |   the main allocator. The  |
|   and stores the mem_map   | |    page->count to 0     | | page will be automatically |
------------------------------ --------------------------- |   placed on the main free  |
                                                           |         page lists         |
                                                           ------------------------------