-
When dealing with large amounts of memory it is better if possible to use physically contiguous pages both in terms of caching and memory access latency.
-
Unfortunately this is not always possible due to external fragmentation problems with the binary buddy allocator (LS - I thought internal fragmentation was more the problem for the binary buddy system? Also is this the only cause of non-contiguous pages?)
-
Linux provides a means for non-contiguous physical memory to be used in contiguous virtual memory via vmalloc().
-
When
vmalloc()
is used, an area is reserved in the virtual address space between VMALLOC_START and VMALLOC_END - the location ofVMALLOC_START
varies depending on the amount of available physical memory, but the region will be at least VMALLOC_RESERVE (aliased to __VMALLOC_RESERVE) in size, which on i386 is 128MiB - the exact size of the region was discussed in more detail in section 4.1 -
The page tables in this region are adjusted as needed to point to physical pages which are allocated with the normal physical page allocator - this means that allocation has to be a multiple of the hardware page size.
-
Because performing allocations this way require modifications to be made to the kernel page tables, there is a limitation on how much memory can be mapped with
vmalloc()
because only the virtual address space betweenVMALLOC_START
andVMALLOC_END
is available. -
Because of the limited nature of
vmalloc()
, it is used sparingly in the core kernel - in fact in 2.4.22 it is only used for storing swap map information (see chapter 11) and for loading kernel modules into memory.
- The vmalloc address space is managed with a 'resource map allocator'. struct vm_struct is used for storing the base, size pairs for this allocator and is defined as follows:
struct vm_struct {
unsigned long flags;
void * addr;
unsigned long size;
struct vm_struct * next;
};
-
In theory a fully-fledged VMA count have been used, but a VMA contains more information that doesn't apply to vmalloc areas and so doing that would be wasteful.
-
Looking at each field:
-
flags
- Either set toVM_ALLOC
if used with vmalloc() orVM_IOREMAP
when ioremap() is used to map high memory into the kernel virtual address space. -
addr
- Starting address of the memory block. -
size
- Size in bytes. -
next
- Pointer to the nextvm_struct
, these are ordered by address and the list is protected by the vmlist_lock lock.
- Each area is separated by at least one page to protect against overruns:
------------------------------------------------------/\/\
| vmalloc | Page | vmalloc | Page | vmalloc | |
| Allocation | Gap | Allocation | Gap | Allocation | |
------------------------------------------------------/\/\
^ ^
| VMALLOC_START VMALLOC_END |
-
When the kernel wants to allocate a new area, the
vm_struct
list is searched linearly via get_vm_area(). -
Space for the struct is allocated with kmalloc().
-
When the virtual area is used for remapping an area for I/O (known as 'ioremapping'),
get_vm_area()
will be called directly to map the requested area.
- Let's take a look at the vmalloc allocation API:
-
vmalloc() - Allocate a number of pages in vmalloc space that satisfy the required size.
-
vmalloc_dma() - Allocate a number of pages in vmalloc space from
ZONE_DMA
. -
vmalloc_32() - Allocates memory that is suitable for 32-bit addressing. This ensures that physical page frames are in
ZONE_NORMAL
which is required by 32-bit devices.
-
Each of these functions call get_vm_area() to find a region large enough to store the requested amount of memory, searching through a linear linked-list of
vm_struct
s and returning a new struct describing the allocated region. -
Next they allocate the necessary PGD entries with vmalloc_area_pages() (and subsequently __vmalloc_area_pages()), PMD entries with alloc_area_pmd() and PTE entries with alloc_area_pte() before finally allocating the page with alloc_page().
-
The page table updated by vmalloc() is not the that of the current process, but rather the reference page table stored at init_mm
->pgd
. As a result processes accessing the vmalloc area will cause a page fault exception (its page tables aren't pointing there) and special handling has to be performed. Diagrammatically:
------------------- -------------------
| Process A Calls | | Process B page |
| vmalloc() | | page faults |
------------------- | do_page_fault() |
| ------------------- Process B Address
v | Space managed by
------------------- Reference Page Table | Process Page Tables
| Reserve space | -------------------- | -------------------
| in Reference |-\ | | v | |
| Page Table | | | | ------------------- | |
------------------- | | | | Fault is in | | |
| | | | | vmalloc region | | |
| | | | ------------------- | |
| \>|------------------| | | |
v | | | | |
------------------- | Inserted Pages | | | |
| After reserving | | | v | |
| space, allocate | |------------------| -------------------- |-----------------|
| pages | | | | Copy in | | |
------------------- | In Reference |-->| necessary page |-->| Copied Entry |
| | | | table entry | | |
| |------------------| | from reference | |-----------------|
| | | -------------------- | |
| | Page Table | | |
| | | | |
| /----------->|------------------| | |
| | | | | |
| | |------------------|<-- VMALLOC_START_ -->| |
| | | | | |
| | | | | |
| | | | | |
v | | | | |
------------------- | | | |
| Buddy Allocator | |------------------|<-- PAGE_OFFSET -->|-----------------|
| alloc_page() | | | | |
------------------- | | | |
^ ^ ^ | Userspace | | Userspace |
| | | | Portion | | Portion |
------ ------ ------ | | | |
|page| |page| |page| | | | |
------ ------ ------ --------------------<-- init_mm->pgd -------------------
Physically Virtually
Non-contiguous Pages Contiguous Pages
-
The function vfree() is responsible for freeing a virtual area. It linearly scans the list of vm_structs looking for the appropriate region and then calls vmfree_area_pages() on the region to be freed.
-
vmfree_area_pages()
is the exact opposite of vmalloc_area_pages() - it walks the page table and frees up PTEs and associated pages for the region rather than allocating them.