Linux memory allocator
Brief discussion on Linux memory allocator.
Table of Contents
This post is a brief discussion on linux memory allocator and should help with understanding the memory allocations in linux OS from physical memory to user space memory. It is not intended to be a complete documentation to understand the specifics in memory allocation and management instead it acts as an introductory material with pointers to the actual documentation that contains more in depth content about the concepts mentioned here.
All the information in this post is based on the latest version v6.2 of Linux kernel.
Physical memory
Physical memory is detected by linux kernel at boot time. It maybe addressed from 0x0 (low) to a maximal (high) address spanning contiguous range. But in practice it may not be contiguous or start exactly at 0x0 address, there can be several contiguous ranges at completely distinct addresses and can contain “holes” that are not accessible for the CPU.
In large scale machines such as servers with multiple processors the memory maybe arranged into banks that incur a different cost to access depending on the “distance” from the processor. This arrangement is called Non-Uniform Memory Access (NUMA). In the NUMA system, each processor has part of the physical memory “attached”. The memory has a single address space. Therefore, any processor could access any memory location directly using its physical address but the time to access it depends on the distance to the processor which results in a non uniform memory access time.
For general purpose computers such as desktops the memory can be accessed by all CPUs uniformly in the same way a single processor accesses its memory. This type of memory architecture for multiple processors is called Uniform memory access (UMA). All processors have equal access time to any memory location and access to the memory is balanced, these systems are also called SMP (symmetric multiprocessor) systems.
To handle this diversity and complexity of physical memory in different memory architectures the linux kernel uses these memory models to provide abstraction for physical memory.
FLATMEM
This the simplest memory model and is suitable for non-NUMA systems with
contiguous, or mostly contiguous, physical memory. In this memory model a
global mem_map
array that maps the entire physical memory where each element
of this array is of type struct page
which describes a page in the physical
memory. Each element in this array is accessed using the page frame number
(PFN) minus ARCH_PFN_OFFSET
the first page frame number in the system. While
this model is efficient it could not handle large holes in the physical
address space.
SPARSEMEM
To handle discontiguous physical memory and support NUMA architectures
DISCONTIGMEM model was developed. It introduced the notion of a memory “node”
which is the basis of NUMA memory management. Each node carries an independent
memory management subsystem with its own memory information. The node data is
represented by struct pglist_data
with typedef pg_data_t
and contains the
node local memory map. Assuming that each node has contiguous physical memory,
having an mem_map
array of struct page
elements per node solves the
problem of large holes in the FLATMEM. But with DISCONTIGMEM it is necessary
to determine which node has the given page in memory to convert its PFN into
page. In this model splitting the node creates unnecessary fragmentation and
also doesn’t supports several advanced features such as memory hot-plug and
hot-remove.
This limitation was overcome with the new model SPARSEMEM. It abstracts the
use of discontiguous memory maps as a collection of sections of arbitrary size
defined by the architectures. Each section is a struct mem_section
logically
a pointer to an array of struct page
and it is stored with logic to aid in
the sections management. With SPARSEMEM the advantage is that it doesn’t
require each NUMA node’s physical address space to be contiguous and it can
handle overlapping address space between nodes which DISCONTIGMEM discards
that memory. With SPARSEMEM there are two ways to convert a PFN to the
corresponding struct page
first is the “classic sparse” and the second is
“sparse vmemmap”. This is selected at build time and it is determined by the
value of CONFIG_SPARSEMEM_VMEMMAP
. The “classic sparse” encodes the section
number of a page in page->flags
and uses high bits of a PFN to access the
section that maps that page frame. Inside a section, the PFN is the index to
the array of pages. The sparse vmemmap uses a virtually mapped memory map to
optimize pfn_to_page ()
and page_to_pfn ()
operations.
Virtual memory
Virtual memory was developed to avoid managing the physical memory directly by the application software as it is very limited, complex and varies across architectures. Using the virtual memory kernel allows protected and controlled access to physical memory. Application software processes can access only virtual address space. On a 32 bit system the there are 2^32 unique memory addresses which gives 4GB of virtual memory and on 64 bit systems it is 2^64 very huge number. The actual memory can be much lesser than that and the kernel allocates physical memory fairly for all processes in the system.
This is achieved by translating the virtual addresses into physical addresses
by CPU using the page tables
maintained by kernel. To make this address
translation simple and efficient the virtual and physical memories are divided
into fixed size chunks called pages
or page frames
. Each physical memory
page is mapped to one or more virtual pages using the page tables. The page
tables are organized hierarchically as they can be larger than the physical
memory.
The tables at the lowest level of the hierarchy contain physical addresses of actual pages used by processes. The tables at higher levels contain physical addresses of the pages belonging to the lower levels. The pointer to the top level page table resides in a register. When the CPU performs the address translation, it uses this register to access the top level page table. The high bits of the virtual address are used to index an entry in the top level page table. That entry is then used to access the next level in the hierarchy with the next bits of the virtual address as the index to that level page table. The lowest bits in the virtual address define the offset inside the actual page.
When there is no page table entry (PTE) for a given virtual memory page then
the processor fails to translate into physical page when needed i.e.
read/write operations. The processor then notifies the kernel of page fault
and kernel decides to allocate physical memory page if the PTE is “valid” and
the physical memory page is available. If PTE is not valid then it means that
the process has attempted to access a virtual address that it should not have
e.g. writing to random addresses in memory, in this case the kernel will
terminate the process to protect the other processes in the system. If the PTE
is valid but no free physical memory pages are available then the kernel
invokes Out Of Memory (OOM) manager which simply verifies that the system is
“truly” out of memory and if so, select a process to kill.
In Linux any number of virtual memory pages are allocated to the processes as requested but physical memory pages are only allocated when needed i.e. attempt to read/write into a virtual memory. This way for the processes the memory is always “available” and kernel manages all the complexities of the physical memory.
Cache memory
To make accessing the system memory fast and efficient among multiple processes with requests to access data and instructions cache memory is used to act as a buffer between the CPU registers and system memory. CPU registers are very limited in size but are as fast as CPU. System memory is much larger but not as fast as CPU as it is located outside CPU. Cache memory operates at same speed as the CPU and is not kept waiting when data is to be accessed from it. Its purpose is to function as a very fast copy of the contents of selected portions of the system memory
Cache memory is configured to read/write data/instructions from/to system memory. When data is to be read from the system memory then system checks if the data is in cache, if available in cache then it is quickly used by CPU. If not available in cache then it is fetched from system memory and used while also copying it into cache so that it can be quickly reused later.
When writing data from the CPU it is first written to the cache memory and marked as “dirty”. The dirty pages in cache memory are periodically written to the backing storage device. When linux decides to reuse them for other purposes it synchronizes them with the backing storage.
Linux kernel recognizes the cache memory as the Page Cache which stores the data in fixed size chunks of memory called “pages”. In older linux kernels there was also buffer cache which contained the data buffers used by the block device drivers. Buffer cache is now outdated and only Page Cache is available in linux kernel.
Cache levels
In latest computers cache memory is available in multiple levels with lower levels being closer to CPU and hence fastest and smaller in size. Higher level caches are larger in size but smaller then system memory and slightly faster but slower than lower level caches.
Most systems have 2 levels of cache and some systems such as the high performance servers have 3 levels. These are namely L1, L2 and L3 caches. L1 cache is often located directly on the CPU chip and runs at the same speed as the CPU. L2 cache is often part of the CPU module, runs nearly at CPU speeds, and is usually a bit larger and slower than L1 cache. L3 cache is often called Last-Level Cache (LLC) and is the largest and slowest cache memory unit.
System calls to allocate memory
In Linux all user space memory is allocated using these system calls, they are
called from inside user space memory management functions libc malloc()
and free()
. As malloc()
does more than just allocate memory to user space
processes such as to keep record of all the memory allocated to all the user
space processes including their stats. Also provides fast and efficient way to
allocate and manage the memory. Due to these reasons it is better to avoid
using these system calls directly instead use malloc()
. Here we can learn
about what these system calls do in brief.
brk () and sbrk ()
These system calls sets up the program break which is the location in
process virtual memory pointing to the end of data segment which contains
global variables and local static variables. Before shared libraries and
mmap()
there was no concept of heap hence the memory obtained was part of
data segment and now it is part of the heap. Increasing the program
break increases the memory available to the process effectively allocating
memory to the process and decreasing it reduces the process memory i.e.
deallocates memory.
brk()
takes an initial address as argument and sets the end of the data
segment to this value.
sbrk()
takes an increment value and increments the program break by this
amount. Calling sbrk() with an increment of 0 is used to find the current
location of the program break which is also the initial heap.
mmap ()
mmap()
system call creates a mapping in the process virtual address space it
takes a start address, length of the mapping, protection and flags to
control the mapping. The start address given to mmap()
is just a hint to the
kernel about where to start the mapping in linux the kernel will pick a
“nearby” page boundary. Kernel then attempts to create a mapping there and if
another mapping is already exists then the kernel picks a new address that may
or may not depend on the hint. The address of the new mapping is returned.
The glibc malloc()
uses this system call when allocating “large” blocks of
memory instead of using brk()
as it is more efficient for large memory
allocation.