Reading xv6OS Thoroughly to Fully Understand the Kernel - Page Tables (PDT/PT) Edition -

This page has been machine-translated from the original page.

Inspired by An Introduction to OS Code Reading: Learning Kernel Internals with UNIX V6, I’m reading xv6 OS.

Because UNIX V6 itself does not run on x86 CPUs, I decided to read the source of kash1064/xv6-public: xv6 OS, a fork of the xv6 OS repository that makes UNIX V6 run on the x86 architecture.

In the previous article, I looked at the locking-related behavior of the kinit1 function, which is executed first from main.

This time, I will trace how kvmalloc initializes the xv6 kernel’s page tables.

About 32-bit paging
- PDT and PT
- How virtual addresses are resolved to physical addresses
The kvmalloc function
The switchkvm function
Summary
Reference Books

About 32-bit paging

This time I will start from xv6OS’s kvmalloc function.

The kvmalloc function creates the kernel page table.

Before reading the source code, I want to organize the paging mechanism in a 32-bit environment.

Paging is implemented by the MMU (Memory Management Unit).

On x86, the MMU maps memory by using a PD (Page Directory) and PT (Page Table).

PDT and PT

Both the PDT and PT contain PDEs (Page Directory Entries) and PTEs (Page Table Entries), respectively.

Each PDE and PTE is 4 bytes (32 bits), and a PD contains 1024 PDEs while a PT contains 1024 PTEs.

That makes both of them 4 KiB in size, which is one default page on x86.

For the details of each entry, the following page is helpful.

Reference: Paging - OSDev Wiki

How virtual addresses are resolved to physical addresses

When translating a virtual address to a physical address, the virtual address is divided into the following three parts.

Top 10 bits: PDE index
Next 10 bits: PTE index
Bottom 12 bits: page offset

During address translation, the MMU looks up the PD and uses the virtual-address index to identify the PDE.

Next, it uses the PDE information together with the virtual-address index to identify the PTE.

Because the PTE resolves the base address of the physical address, the final physical address can be identified by combining it with the offset inside the virtual address.

The following diagram explained the flow clearly, so I am quoting it here.

Reference image: 6th Episode: Making Data on Memory Invisible (Part 1) | Nikkei Cross Tech (xTECH)

A PD is simply an array of PDEs.

The starting address of the PD is stored in the CR3 register.

The `kvmalloc` function

Now then, continuing from the previous article, let’s follow the processing of the kernel’s main function.

// Bootstrap processor starts running C code here.
// Allocate a real stack and switch to it, first
// doing some setup required for memory allocator to work.
int main(void)
{
  kinit1(end, P2V(4*1024*1024)); // phys page allocator
  kvmalloc();      // kernel page table
  mpinit();        // detect other processors
  lapicinit();     // interrupt controller
  seginit();       // segment descriptors
  picinit();       // disable pic
  ioapicinit();    // another interrupt controller
  consoleinit();   // console hardware
  uartinit();      // serial port
  pinit();         // process table
  tvinit();        // trap vectors
  binit();         // buffer cache
  fileinit();      // file table
  ideinit();       // disk 
  startothers();   // start other processors
  kinit2(P2V(4*1024*1024), P2V(PHYSTOP)); // must come after startothers()
  userinit();      // first user process
  mpmain();        // finish this processor's setup
}

In the previous article, I got as far as the end of kinit1, so this time I will start with kvmalloc.

The kvmalloc function is defined in vm.c as follows.

// Allocate one page table for the machine for the kernel address
// space for scheduler processes.
void kvmalloc(void)
{
  kpgdir = setupkvm();
  switchkvm();
}

The kvmalloc function changes the kernel page table.

At this point, it switches to a page table design that lets the kernel manage process address spaces.

Reference: P.30 xv6OS

In xv6OS, one page table is created per process, and when a process runs, the kernel uses that process’s page table.

There is also a page table called kpgdir that is used when the CPU is not running a process.

About virtual memory

In xv6OS, as well as in 32-bit OSes such as Windows and Linux on the x86 architecture, each process is given a private memory space called virtual memory.

Because each process’s virtual address space is independent, each one can access memory within its own 2 GiB virtual address space.

(When a process references an address, the kernel decides which physical memory address it actually points to.)

Thanks to this mechanism, even when paging is used, a process can always access memory through virtual addresses, so applications can be developed without worrying about changes to physical memory addresses.

A 32-bit OS can manage at most 4 GiB of address space.

However, by creating a virtual address space for each process and using paging, the system can operate as if it had a larger memory space than the actual physical memory.

It also prevents memory conflicts and makes access control possible because each process has its own isolated memory space.

By the way, the reason the virtual address space is 2 GiB even though a 32-bit OS can handle up to 4 GiB seems to be that the upper 2 GiB are reserved for the kernel.

The following diagram is helpful if you want more detail.

Reference image: P.31 xv6OS

The `setupkvm` function

For now, let’s trace what kvmalloc does.

In the line kpgdir = setupkvm();, the return value of setupkvm is stored in kpgdir, which has type pde_t.

Incidentally, mmu.h shows that these page-table entry types are uint.

typedef uint pte_t;

The setupkvm function is defined in vm.c as follows.

// Set up kernel part of a page table.
pde_t* setupkvm(void)
{
  pde_t *pgdir;
  struct kmap *k;

  if((pgdir = (pde_t*)kalloc()) == 0) return 0;
  memset(pgdir, 0, PGSIZE);
  if (P2V(PHYSTOP) > (void*)DEVSPACE) panic("PHYSTOP too high");
  for(k = kmap; k < &kmap[NELEM(kmap)]; k++)
    if(mappages(pgdir, k->virt, k->phys_end - k->phys_start,
                (uint)k->phys_start, k->perm) < 0) 
    {
      freevm(pgdir);
      return 0;
    }
  return pgdir;
}

First, here are the variable declarations.

pde_t *pgdir;
struct kmap *k;

As noted above, pde_t is the same as uint.

The next structure, kmap, is the kernel memory-mapping table defined in vm.c.

// This table defines the kernel's mappings, which are present in
// every process's page table.
static struct kmap {
  void *virt;
  uint phys_start;
  uint phys_end;
  int perm;
} kmap[] = {
  { (void*)KERNBASE, 0,             EXTMEM,    PTE_W}, // I/O space
  { (void*)KERNLINK, V2P(KERNLINK), V2P(data), 0},     // kern text+rodata
  { (void*)data,     V2P(data),     PHYSTOP,   PTE_W}, // kern data+memory
  { (void*)DEVSPACE, DEVSPACE,      0,         PTE_W}, // more devices
};

It defines the start addresses, end addresses, and permissions.

The initialized mappings match the layout in the image I showed earlier.

Reference image: P.31 xv6OS

Allocating pages with `kalloc`

After that, it checks and initializes the memory region to use after creating the page table.

if((pgdir = (pde_t*)kalloc()) == 0) return 0;
memset(pgdir, 0, PGSIZE);
if (P2V(PHYSTOP) > (void*)DEVSPACE) panic("PHYSTOP too high");

The kalloc function is defined in kalloc.c as follows.

// Allocate one 4096-byte page of physical memory.
// Returns a pointer that the kernel can use.
// Returns 0 if the memory cannot be allocated.
char* kalloc(void)
{
  struct run *r;
  if(kmem.use_lock) acquire(&kmem.lock);
  r = kmem.freelist;
  if(r) kmem.freelist = r->next;
  if(kmem.use_lock) release(&kmem.lock);
  return (char*)r;
}

As I confirmed in the previous article, locking is still disabled at this point.

So I will ignore acquire and release.

kmem.freelist holds freed pages in a singly linked list.

In other words, it looks like kalloc simply checks whether there is a free region and allocates one page.

If allocation succeeds, kalloc returns a pointer to the allocated page and stores it in pgdir.

That makes it possible for memset(pgdir, 0, PGSIZE); in the next line to initialize that one page of memory to zero.

I already checked the behavior of memset in the previous article, so I will omit it here.

The last line only checks whether PHYSTOP exceeds DEVSPACE, so I will skip that as well.

In the end, the allocated one-page region pgdir is used as the PDT after this.

Creating PTEs for virtual addresses

By this point, pgdir contains a reference to the one-page memory region that was allocated.

for(k = kmap; k < &kmap[NELEM(kmap)]; k++) 
{
    if(mappages(pgdir, k->virt, k->phys_end - k->phys_start, (uint)k->phys_start, k->perm) < 0) 
    {
        freevm(pgdir);
        return 0;
    }
}
return pgdir;

NELEM is the following macro defined in defs.h, and it returns the number of elements in a fixed-size array.

// number of elements in fixed-size array
#define NELEM(x) (sizeof(x)/sizeof((x)[0]))

In other words, the loop condition k = kmap; k < &kmap[NELEM(kmap)]; k++ simply means “process every element in the kmap array.”

The key part is the following lines.

if(mappages(pgdir, k->virt, k->phys_end - k->phys_start,(uint)k->phys_start, k->perm) < 0) 
{
    freevm(pgdir);
    return 0;
}

The mappages function is defined in vm.c. It creates PTEs for the virtual addresses passed as arguments and points them to the physical addresses passed as arguments.

A PTE stores the physical address of the page together with access-control and attribute information.

Reference image: 6th Episode: Making Data on Memory Invisible (Part 1) | Nikkei Cross Tech (xTECH)

Reference: ０から作るOS開発　ページングその１　ページとPTEとPDE

Let’s first look at the source.

// Create PTEs for virtual addresses starting at va that refer to
// physical addresses starting at pa. va and size might not
// be page-aligned.
static int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
  char *a, *last;
  pte_t *pte;

  a = (char*)PGROUNDDOWN((uint)va);
  last = (char*)PGROUNDDOWN(((uint)va) + size - 1);
  for(;;){
    if((pte = walkpgdir(pgdir, a, 1)) == 0)
      return -1;
    if(*pte & PTE_P)
      panic("remap");
    *pte = pa | perm | PTE_P;
    if(a == last)
      break;
    a += PGSIZE;
    pa += PGSIZE;
  }
  return 0;
}

mappages takes the following five arguments.

*pgdir: the allocated page (PDT)
*va: the virtual address to create PTEs for
size: the size of the physical address range to associate
pa: the starting physical address to associate
perm: the permissions to assign

First, based on the virtual address to create PTEs for and the size of the physical-address range to associate, page-aligned addresses are stored in a and last.

I already explained PGROUNDDOWN in a previous article, so I will skip it here.

The next loop runs while adding the page size until a becomes equal to last.

for(;;){
  if((pte = walkpgdir(pgdir, a, 1)) == 0) return -1;
  if(*pte & PTE_P) panic("remap");
  *pte = pa | perm | PTE_P;
  if(a == last) break;
  a += PGSIZE;
  pa += PGSIZE;
}

The function that actually allocates the PTE area is walkpgdir.

walkpgdir is defined in vm.c.

It returns the address of the PTE in the page table pgdir that corresponds to the virtual address passed as an argument.

// Return the address of the PTE in page table pgdir
// that corresponds to virtual address va.  If alloc!=0,
// create any required page table pages.
static pte_t * 
walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
  pde_t *pde;
  pte_t *pgtab;

  pde = &pgdir[PDX(va)];
  if(*pde & PTE_P){
    pgtab = (pte_t*)P2V(PTE_ADDR(*pde));
  } else {
    if(!alloc || (pgtab = (pte_t*)kalloc()) == 0) return 0;
    // Make sure all those PTE_P bits are zero.
    memset(pgtab, 0, PGSIZE);
    // The permissions here are overly generous, but they can
    // be further restricted by the permissions in the page table
    // entries, if necessary.
    *pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;
  }
  return &pgtab[PTX(va)];
}

First, the line pde = &pgdir[PDX(va)]; resolves the index in the PDT used for the virtual address.

PDX(va) is a macro defined in mmu.h that resolves the page-directory index from a virtual address.

// A virtual address 'la' has a three-part structure as follows:
//
// +--------10------+-------10-------+---------12----------+
// | Page Directory |   Page Table   | Offset within Page  |
// |      Index     |      Index     |                     |
// +----------------+----------------+---------------------+
//  \--- PDX(va) --/ \--- PTX(va) --/

// page directory index
#define PDX(va)         (((uint)(va) >> PDXSHIFT) & 0x3FF)

// page table index
#define PTX(va)         (((uint)(va) >> PTXSHIFT) & 0x3FF)

#define PTXSHIFT        12      // offset of PTX in a linear address
#define PDXSHIFT        22      // offset of PDX in a linear address

Next, the line if(*pde & PTE_P) checks the Present bit of the PDE it found.

The Present bit of a PDE is used to tell whether the page is currently present.

Reference: Paging - OSDev Wiki

Here, it seems to be checking whether a PDE already exists for the virtual address received as an argument.

In the initial state, the PD has been initialized to 0 by memset, so the following code runs.

if(!alloc || (pgtab = (pte_t*)kalloc()) == 0) return 0;

// Make sure all those PTE_P bits are zero.
memset(pgtab, 0, PGSIZE);
// The permissions here are overly generous, but they can
// be further restricted by the permissions in the page table
// entries, if necessary.
*pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;

If the third argument alloc is set to 1 and there is no PDE corresponding to the virtual address, the necessary PT is created here.

I already covered the behavior of kalloc, so I will omit it.

The one-page region obtained here is referenced as pgtab.

It is also initialized to 0 in every byte by memset.

Then the PDE is given the address of pgtab, converted to a physical address, together with the permissions to set on the PDE.

Ultimately, the return value of walkpgdir is a reference to the PTE slot resolved from the PDE corresponding to the virtual address received as an argument.

return &pgtab[PTX(va)];

Now that walkpgdir has finished, we return to mappages.

for(;;){
  if((pte = walkpgdir(pgdir, a, 1)) == 0) return -1;
  if(*pte & PTE_P) panic("remap");
  *pte = pa | perm | PTE_P;
  if(a == last) break;
  a += PGSIZE;
  pa += PGSIZE;
}

A PTE slot corresponding to this virtual address has now been allocated in pte.

When walkpgdir finishes, every PTE is still zero-initialized, so the line *pte = pa | perm | PTE_P; assigns the physical address and permissions.

With that, mappages finishes creating the PTEs that connect virtual addresses and physical addresses.

The `freevm` function

Finally, we return to setupkvm.

The following loop repeated once for each element in the kmap array.

mappages created the PDEs and PTEs that connect virtual addresses and physical addresses.

for(k = kmap; k < &kmap[NELEM(kmap)]; k++)
{
  if(mappages(pgdir, k->virt, k->phys_end - k->phys_start,
              (uint)k->phys_start, k->perm) < 0) {
    freevm(pgdir);
    return 0;
  }
}
return pgdir;

If page-table creation fails, pgdir, the PDT page used up to this point, is passed to freevm and the memory region is freed.

The freevm function is also defined in vm.c.

Since freevm is not executed at this point, I will omit the details.

// Free a page table and all the physical memory pages
// in the user part.
void freevm(pde_t *pgdir)
{
  uint i;
  if(pgdir == 0) panic("freevm: no pgdir");
  deallocuvm(pgdir, KERNBASE, 0);
  for(i = 0; i < NPDENTRIES; i++){
    if(pgdir[i] & PTE_P){
      char * v = P2V(PTE_ADDR(pgdir[i]));
      kfree(v);
    }
  }
  kfree((char*)pgdir);
}

The `switchkvm` function

At this point, setupkvm is finished, and it returns the kernel’s PDT.

// Allocate one page table for the machine for the kernel address
// space for scheduler processes.
void kvmalloc(void)
{
  kpgdir = setupkvm();
  switchkvm();
}

As I confirmed in the first half of this article, on x86 the starting address of the PDT referenced by the kernel is stored in the CR3 register.

The behavior of switchkvm is simple: it converts the virtual address of the created PDT into a physical address and stores it in the CR3 register.

// Switch h/w page table register to the kernel-only page table,
// for when no process is running.
void switchkvm(void)
{
  lcr3(V2P(kpgdir));   // switch to the kernel page table
}

The lcr3 function is defined in x86.h.

static inline void
lcr3(uint val)
{
  asm volatile("movl %0,%%cr3" : : "r" (val));
}

With that, kvmalloc has finished creating and switching the kernel page table.

Summary

With this, I have read 3 of the 18 functions called from the xv6 kernel’s main function.

There is still a long way to go, but I will keep working through it little by little.

Reference Books

Published Jan 26, 2022

Aspiring Reverse Engineer and CTF Player (Team: 0nePadding). Passionate about WinDbg and Anti-Virus internals. OSCP / CISSP. Working at Microsoft Japan, but all views expressed are my own.かしわば(@kash1064) on Twitter

Reading xv6OS Thoroughly to Fully Understand the Kernel - Page Tables (PDT/PT) Edition -

Table of Contents