Seriously Reading the xv6 OS to Fully Understand the Kernel — Linker and Paging Edition

This page has been machine-translated from the original page.

I have been reading xv6 OS, inspired by the book はじめてのOSコードリーディング ~UNIX V6で学ぶカーネルのしくみ.

I want to get better at reverse engineering and deepen my understanding of kernels and operating systems.

詳解 Linuxカーネル felt a bit heavy, so I was looking for somewhere lighter to start. I came across UNIX V6 — an OS with a total of around 10,000 lines of code, which is just barely comprehensible for a human — and became interested.

However, UNIX V6 itself does not run on x86 CPUs, so I decided to read the source code of kash1064/xv6-public: xv6 OS, a fork of xv6 OS — which is UNIX V6 adapted to run on x86 architecture.

Continuing from the previous article, I will keep reading the xv6 OS source code.

In the previous article, I read through the xv6 OS bootstrap code and traced it up to the point just before the kernel body is loaded.

This time, I will trace what actually happens when the kernel is loaded.

Loading the Kernel
Building the Kernel Program
Kernel Entry Point
Summary
References

Loading the Kernel

Let me first review where the kernel was being loaded during the bootstrap phase.

The kernel was read into memory at address 0x10000, as shown below.

After that, the program headers were loaded and the entry() function was called, transferring control to the kernel.

void bootmain(void)
{
  struct elfhdr *elf;
  struct proghdr *ph, *eph;
  void (*entry)(void);
  uchar* pa;

  elf = (struct elfhdr*)0x10000;  // scratch space

  // Read 1st page off disk
  readseg((uchar*)elf, 4096, 0);

  // Is this an ELF executable?
  if(elf->magic != ELF_MAGIC)
    return;  // let bootasm.S handle error

  // Load each program segment (ignores ph flags).
  ph = (struct proghdr*)((uchar*)elf + elf->phoff);
  eph = ph + elf->phnum;
  for(; ph < eph; ph++){
    pa = (uchar*)ph->paddr;
    readseg(pa, ph->filesz, ph->off);
    if(ph->memsz > ph->filesz)
      stosb(pa + ph->filesz, 0, ph->memsz - ph->filesz);
  }

  // Call the entry point from the ELF header.
  // Does not return!
  entry = (void(*)(void))(elf->entry);
  entry();
}

So this time I want to start by tracking down the entry() function.

Building the Kernel Program

Let me trace through how the kernel program is built.

The final image file xv6.img is generated with the following commands.

xv6.img is produced by embedding bootblock and kernel into a 0x10000-byte blank area.

xv6.img: bootblock kernel
dd if=/dev/zero of=xv6.img count=10000
dd if=bootblock of=xv6.img conv=notrunc
dd if=kernel of=xv6.img seek=1 conv=notrunc

We already traced bootblock in the previous article, so this time let’s focus on kernel.

kernel: $(OBJS) entry.o entryother initcode kernel.ld
$(LD) $(LDFLAGS) -T kernel.ld -o kernel entry.o $(OBJS) -b binary initcode entryother
$(OBJDUMP) -S kernel > kernel.asm
$(OBJDUMP) -t kernel | sed '1,/SYMBOL TABLE/d; s/ .* / /; /^$$/d' > kernel.sym

The dependencies for kernel are $(OBJS) entry.o entryother initcode kernel.ld.

The list of $(OBJS) is quite long, so I will skip it. It includes kernel modules such as main.o.

The last two lines only output the binary’s disassembly and symbol information; the actual binary is built by the line $(LD) $(LDFLAGS) -T kernel.ld -o kernel entry.o $(OBJS) -b binary initcode entryother.

LD is used in the form $(TOOLPREFIX)ld, just like the GCC described in the previous article.

Since we are not cross-compiling this time, the plain ld command is used.

LD = $(TOOLPREFIX)ld

# FreeBSD ld wants ``elf_i386_fbsd''
LDFLAGS += -m $(shell $(LD) -V | grep elf_i386 2>/dev/null | head -n 1)

LDFLAGS extracts elf_i386 from the output of ld -V and passes it as the -m elf_i386 option.

ld -V is the version-check command for ld with an option that lists the supported emulators.

The actual command executed at build time looks like this:

The -T option, like the -c option, reads link commands from a linker script (kernel.ld).

The -b option specifies the binary format of the subsequently listed input object files; here binary is specified.

The initcode and entryother that follow are binaries assembled from assembly files.

ld -m elf_i386 -T kernel.ld -o kernel \
entry.o bio.o console.o exec.o file.o fs.o ide.o ioapic.o kalloc.o kbd.o lapic.o log.o main.o mp.o picirq.o pipe.o proc.o sleeplock.o spinlock.o string.o swtch.o syscall.o sysfile.o sysproc.o trapasm.o trap.o uart.o vectors.o vm.o  \
-b binary initcode entryother

Reference: LD, the GNU linker - Options

Next, let’s look inside the linker script kernel.ld.

Linker Script

First, what is a linker script? It is a file that specifies the memory layout of objects when the linker links object files to produce an executable.

Normally the linker’s built-in default linker script is used, so you do not need to specify one explicitly.

Incidentally, the default linker script built into the linker can be printed by running ld with the --verbose option.

However, for programs like an OS or embedded systems where the general-purpose OS management facilities are unavailable, a custom linker script must be configured.

Reference: Scripts (LD)

Reference: Basic Script Concepts (LD)

Reference: GNU Cを使いこなそう | 株式会社コンピューテックス

The full linker script used to build the xv6 OS kernel is as follows.

/* Simple linker script for the JOS kernel.
   See the GNU ld 'info' manual ("info ld") to learn the syntax. */

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)

SECTIONS
{
/* Link the kernel at this address: "." means the current address */
        /* Must be equal to KERNLINK */
. = 0x80100000;

.text : AT(0x100000) {
*(.text .stub .text.* .gnu.linkonce.t.*)
}

PROVIDE(etext = .);/* Define the 'etext' symbol to this value */

.rodata : {
*(.rodata .rodata.* .gnu.linkonce.r.*)
}

/* Include debugging information in kernel memory */
.stab : {
PROVIDE(__STAB_BEGIN__ = .);
*(.stab);
PROVIDE(__STAB_END__ = .);
}

.stabstr : {
PROVIDE(__STABSTR_BEGIN__ = .);
*(.stabstr);
PROVIDE(__STABSTR_END__ = .);
}

/* Adjust the address for the data segment to the next page */
. = ALIGN(0x1000);

/* Conventionally, Unix linkers provide pseudo-symbols
 * etext, edata, and end, at the end of the text, data, and bss.
 * For the kernel mapping, we need the address at the beginning
 * of the data section, but that's not one of the conventional
 * symbols, because the convention started before there was a
 * read-only rodata section between text and data. */
PROVIDE(data = .);

/* The data segment */
.data : {
*(.data)
}

PROVIDE(edata = .);

.bss : {
*(.bss)
}

PROVIDE(end = .);

/DISCARD/ : {
*(.eh_frame .note.GNU-stack)
}
}

Linker Script Structure

The minimum required element in a linker script is the SECTIONS element.

A MEMORY element is often defined, but it is not required.

The SECTIONS element defines sections and places them at arbitrary addresses.

Both physical addresses and virtual addresses can be defined for these addresses.

Reference: Mastering GNU C | Computex Co., Ltd.

Reference: リンカスクリプトの書き方

The simplest linker script with only the SECTIONS element looks like the following example.

SECTIONS
{
  . = 0x10000;
  .text : { *(.text) }
  . = 0x8000000;
  .data : { *(.data) }
  .bss : { *(.bss) }
}

Reference: Simple Example (LD)

In the xv6 OS linker script, the following sections are defined:

.text : Where the executable binary is placed. Typically read/execute permission only.
.rodata : Where read-only data is placed.
.stab : Where an array of fixed-length structures called stabs is placed.
.stabstr : Where variable-length strings referenced from stabs are placed.
.data : Where readable and writable data is placed.
.bss : Where block starting symbols (objects that are declared but have not yet been assigned a value) are placed.

Reference: STABS - Using Stabs in Their Own Sections

Reference: STABS: Stab Section Basics

Reference: .bss - Wikipedia

Let me now walk through the contents of the linker script in order.

Defining the Entry Point

Looking at the first lines of the linker script, three things are defined.

/* Simple linker script for the JOS kernel.
   See the GNU ld 'info' manual ("info ld") to learn the syntax. */

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)

OUTPUT_FORMAT defines the format of the output binary.

OUTPUT_ARCH specifies the architecture that the output binary targets.

ENTRY specifies the symbol name of the function that will be executed first.

Reference: Entry Point (LD)

The _start specified here is defined in entry.S as follows.

# By convention, the _start symbol specifies the ELF entry point.
# Since we haven't set up virtual memory yet, our entry point is
# the physical address of 'entry'.
.globl _start
_start = V2P_WO(entry)

# Entering xv6 on boot processor, with paging off.
.globl entry
entry:
  # Turn on page size extension for 4Mbyte pages
  movl    %cr4, %eax
  orl     $(CR4_PSE), %eax
  movl    %eax, %cr4
  # Set page directory
  movl    $(V2P_WO(entrypgdir)), %eax
  movl    %eax, %cr3
  # Turn on paging.
  movl    %cr0, %eax
  orl     $(CR0_PG|CR0_WP), %eax
  movl    %eax, %cr0

  # Set up the stack pointer.
  movl $(stack + KSTACKSIZE), %esp

  # Jump to main(), and switch to executing at
  # high addresses. The indirect call is needed because
  # the assembler produces a PC-relative instruction
  # for a direct jump.
  mov $main, %eax
  jmp *%eax

.comm stack, KSTACKSIZE

The details of entry.S will be described later.

Reference: xv6: OSはどうメモリを参照、管理するのか（前編） - yohei.codes

SECTIONS: Defining the text Section

First, let’s look at the part that defines the text section.

/* Link the kernel at this address: "." means the current address */
/* Must be equal to KERNLINK */
. = 0x80100000;

.text : AT(0x100000) {
*(.text .stub .text.* .gnu.linkonce.t.*)
}

PROVIDE(etext = .);/* Define the 'etext' symbol to this value */

The first line, . = 0x80100000;, sets the value of the special symbol ..

This is used as the location counter.

Sections defined afterward start at the address pointed to by the location counter.

When a section is defined, the location counter is incremented by that section’s size.

Reference: Simple Example (LD)

In xv6 OS, the initial value of the location counter is set to 0x80100000.

This means that the instruction addresses in the binary produced by the linker start from 0x80100000.

Section definitions use the following structure:

section [address] [(type)] :
  [AT(lma)]
  [ALIGN(section_align) | ALIGN_WITH_INPUT]
  [SUBALIGN(subsection_align)]
  [constraint]
  {
    output-section-command
    output-section-command
    …
  } [>region] [AT>lma_region] [:phdr :phdr …] [=fillexp] [,]

Reference: Output Section Description (LD)

AT(0x100000) defines the load address of the section as 0x100000.

Reference: Using LD, the GNU linker - Section Options

To be honest, I had no idea at all what *(.text .stub .text.* .gnu.linkonce.t.*) was doing, but it appears to be a line that defines the contents of the section.

There are several ways to define the contents, but the basic form is filename(symbol).

Multiple definitions can span multiple lines.

When * is used in place of a filename, as in *(), all object files provided at link time are targeted.

In other words, *(.text .stub .text.* .gnu.linkonce.t.*) is an instruction telling the linker to place the data from the .text .stub .text.* .gnu.linkonce.t.* sections of each input object file into the .text section of the output executable.

Reference: Using LD, the GNU linker - Section Placement

Files such as entry.o and bio.o given as linker inputs are all compiled as 32-bit ELF format, so each of them has a header and a .text section.

That is why the definition above is needed — to merge them all into a single executable.

Once the .text section definition is complete, we need to define etext to mark the end of the segment.

PROVIDE(etext = .);/* Define the 'etext' symbol to this value */

Reference: Man page of END

Here, PROVIDE is used to set etext at the current location.

PROVIDE is a directive that creates a symbol only when that symbol is undefined in the code.

Reference: PROVIDE (LD)

SECTIONS: Defining the rodata Section

Next, the .rodata section is defined.

rodata stands for Read Only Data.

.rodata : {
*(.rodata .rodata.* .gnu.linkonce.r.*)
}

The linker definition uses the same syntax as the .text section, so I will skip the explanation.

SECTIONS: Defining the stab and stabstr Sections

Next, the debug-purpose stab sections are defined.

/* Include debugging information in kernel memory */
.stab : {
PROVIDE(__STAB_BEGIN__ = .);
*(.stab);
PROVIDE(__STAB_END__ = .);
}

.stabstr : {
PROVIDE(__STABSTR_BEGIN__ = .);
*(.stabstr);
PROVIDE(__STABSTR_END__ = .);
}

ld does not create sections whose defined contents turn out to be empty.

In the default code, each binary does not have a .stab section, so there was no .stab section in kernel either.

However, I confirmed that adding -gstabs to the gcc compile options defined in the Makefile causes a .stab section to be created, and that section then appears in the linked kernel as well.

SECTIONS: Defining the data Section

The .data section holds readable and writable data.

/* Adjust the address for the data segment to the next page */
. = ALIGN(0x1000);

/* Conventionally, Unix linkers provide pseudo-symbols
* etext, edata, and end, at the end of the text, data, and bss.
* For the kernel mapping, we need the address at the beginning
* of the data section, but that's not one of the conventional
* symbols, because the convention started before there was a
* read-only rodata section between text and data. */
PROVIDE(data = .);

/* The data segment */
.data : {
*(.data)
}

PROVIDE(edata = .);

First, the line . = ALIGN(0x1000); aligns the current location to a 0x1000 boundary.

This line is not assigning a specific address to the location counter like . = 0x80100000; did earlier.

Instead, ALIGN aligns the current location to the boundary of the specified value, starting from the current location at the time ALIGN is executed.

Looking at the actual generated kernel binary, we can see that the binary data was continuous up to 0x80107aa9, and then the starting address of the .data section becomes 0x80108000.

$ objdump -D kernel | grep -5 "Disassembly of section .data:"
80107aa6:67 6e                outsb  %ds:(%si),(%dx)
80107aa8:65                   gs
80107aa9:64                   fs
...

Disassembly of section .data:

80108000 <ctlmap>:
...
80108010:11 17                adc    %edx,(%edi)
80108012:05 12 14 19 15       add    $0x15191412,%eax

This is the result of aligning to the 0x1000 boundary from the point where the current location counter had been incremented to 0x80107aaa.

Reference: Using LD, the GNU linker - Arithmetic Functions

The rest of the definitions are the same as those already covered, so I will skip them.

SECTIONS: Defining the bss Section

The .bss section is defined as follows.

This is the same as what we have already seen, so I will skip the explanation.

PROVIDE(edata = .);
.bss : {
*(.bss)
}
PROVIDE(end = .);

SECTIONS: DISCARD

Sections listed under /DISCARD/ are not linked into the generated object.

/DISCARD/ : {
*(.eh_frame .note.GNU-stack)
}

.eh_frame is a section generated by gcc that stores information for obtaining a stack backtrace.

.note.GNU-stack is used in Linux object files to declare stack attributes.

Kernel Entry Point

Next, let’s look at the _start function that was defined as the kernel’s entry point at link time.

The entry.S file where _start is defined contains the following code.

# The xv6 kernel starts executing in this file. This file is linked with
# the kernel C code, so it can refer to kernel symbols such as main().
# The boot block (bootasm.S and bootmain.c) jumps to entry below.
        
# Multiboot header, for multiboot boot loaders like GNU Grub.
# http://www.gnu.org/software/grub/manual/multiboot/multiboot.html
#
# Using GRUB 2, you can boot xv6 from a file stored in a
# Linux file system by copying kernel or kernelmemfs to /boot
# and then adding this menu entry:
#
# menuentry "xv6" {
# insmod ext2
# set root='(hd0,msdos1)'
# set kernel='/boot/kernel'
# echo "Loading ${kernel}..."
# multiboot ${kernel} ${kernel}
# boot
# }

#include "asm.h"
#include "memlayout.h"
#include "mmu.h"
#include "param.h"

# Multiboot header.  Data to direct multiboot loader.
.p2align 2
.text
.globl multiboot_header
multiboot_header:
  #define magic 0x1badb002
  #define flags 0
  .long magic
  .long flags
  .long (-magic-flags)

# By convention, the _start symbol specifies the ELF entry point.
# Since we haven't set up virtual memory yet, our entry point is
# the physical address of 'entry'.
.globl _start
_start = V2P_WO(entry)

# Entering xv6 on boot processor, with paging off.
.globl entry
entry:
  # Turn on page size extension for 4Mbyte pages
  movl    %cr4, %eax
  orl     $(CR4_PSE), %eax
  movl    %eax, %cr4
  # Set page directory
  movl    $(V2P_WO(entrypgdir)), %eax
  movl    %eax, %cr3
  # Turn on paging.
  movl    %cr0, %eax
  orl     $(CR0_PG|CR0_WP), %eax
  movl    %eax, %cr0

  # Set up the stack pointer.
  movl $(stack + KSTACKSIZE), %esp

  # Jump to main(), and switch to executing at
  # high addresses. The indirect call is needed because
  # the assembler produces a PC-relative instruction
  # for a direct jump.
  mov $main, %eax
  jmp *%eax

.comm stack, KSTACKSIZE

Multiboot Header

Reading through entry.S from the top, we find the following code.

First, .p2align 2 on the first line aligns the binary to a 4-byte boundary.

Reference: P2align (Using as)

Reference: gcc - What does .p2align do in asm code? - Stack Overflow

Immediately under the .text directive, multiboot_header is defined.

Here, the multiboot header is defined to support the multiboot specification.

# Multiboot header.  Data to direct multiboot loader.
.p2align 2
.text
.globl multiboot_header
multiboot_header:
  #define magic 0x1badb002
  #define flags 0
  .long magic
  .long flags
  .long (-magic-flags)

The multiboot specification standardizes how a bootloader loads an x86 operating system kernel.

In the previous article, I examined the xv6 OS bootloader code; if you want to boot the xv6 OS kernel with GRUB, for example, the kernel must comply with this multiboot specification.

Bootloaders such as GRUB are adopted as the standard in Linux systems and others. (GRUB2 is normally used.)

Reference: マルチブート仕様

Reference: GRUBでOSを起動する - OSのようなもの

Reference: Trying to Run a Simple OS Kernel with GRUB - Momoiro Technology

I am planning to actually try booting the xv6 OS with GRUB after I have finished reading through the kernel, so I will move on without tracing the code in detail.

Defining the Physical Address of the Entry Point

The next piece of code is as follows.

# By convention, the _start symbol specifies the ELF entry point.
# Since we haven't set up virtual memory yet, our entry point is
# the physical address of 'entry'.
.globl _start
_start = V2P_WO(entry)

The .globl directive is a declaration that makes a symbol accessible from all linked files.

_start is the symbol that was referenced as the entry point from the linker script and elsewhere, and this declaration enables it to be called from outside entry.S.

Reference: .globl - Google Search

Next, let’s look at the line _start = V2P_WO(entry).

V2P_WO is the following macro defined in memlayout.h.

// Memory layout

#define EXTMEM  0x100000            // Start of extended memory
#define PHYSTOP 0xE000000           // Top physical memory
#define DEVSPACE 0xFE000000         // Other devices are at high addresses

// Key addresses for address space layout (see kmap in vm.c for layout)
#define KERNBASE 0x80000000         // First kernel virtual address
#define KERNLINK (KERNBASE+EXTMEM)  // Address where kernel is linked

#define V2P(a) (((uint) (a)) - KERNBASE)
#define P2V(a) ((void *)(((char *) (a)) + KERNBASE))

#define V2P_WO(x) ((x) - KERNBASE)    // same as V2P, but without casts
#define P2V_WO(x) ((x) + KERNBASE)    // same as P2V, but without casts

This is simply a macro that takes an address as an argument and subtracts KERNBASE, which is set to 0x80000000.

Originally, the linker had linked the kernel’s .text section using 0x80100000 as its base.

This is a mechanism that separates the virtual memory ranges for user mode and kernel mode, allowing the CPU to load the kernel’s virtual addresses via x86 CPU paging.

Reference: xv6: OSはどうメモリを参照、管理するのか（前編） - yohei.codes

However, at the point where _start = V2P_WO(entry) is executed, virtual memory has not yet been configured on the kernel side, so 0x80100000 is subtracted to assign the entry point _start to a physical address.

Loading the Kernel Entry Point

Let’s continue tracing the rest of entry.S.

First, .globl entry makes entry a symbol accessible from outside.

What entry does can be summarized simply: it loads the kernel’s virtual address using paging.

When the entry label is called, the paging mechanism has not yet been enabled, so the first thing we do here is enable it.

# Entering xv6 on boot processor, with paging off.
.globl entry
entry:
  # Turn on page size extension for 4Mbyte pages
  movl    %cr4, %eax
  orl     $(CR4_PSE), %eax
  movl    %eax, %cr4
  # Set page directory
  movl    $(V2P_WO(entrypgdir)), %eax
  movl    %eax, %cr3
  # Turn on paging.
  movl    %cr0, %eax
  orl     $(CR0_PG|CR0_WP), %eax
  movl    %eax, %cr0

  # Set up the stack pointer.
  movl $(stack + KSTACKSIZE), %esp

  # Jump to main(), and switch to executing at
  # high addresses. The indirect call is needed because
  # the assembler produces a PC-relative instruction
  # for a direct jump.
  mov $main, %eax
  jmp *%eax

.comm stack, KSTACKSIZE

What Is Paging?

Before tracing the code, let me briefly summarize what paging is.

Paging is a method of managing memory by dividing it into fixed-size chunks called pages.

This allows the divided memory regions to be treated as a linear address space. It also allows auxiliary storage devices such as SSDs to provide a virtual page area, making it possible to handle more memory than the physical RAM capacity.

In paging, writing a page from main memory to auxiliary storage is called a “page-out,” while writing a page back from auxiliary storage to main memory is called a “page-in” or “swap-in.”

Through the paging mechanism, unused memory regions are saved to auxiliary storage via page-out.

The next time that memory region is needed, the OS raises an exception called a “page fault” for the address that does not exist in physical memory, and an interrupt triggers a swap-in to write the page back to physical memory.

Reference: x86_64アーキテクチャ - ばびろん’s すたっくメモリアクセス

Reference: x86_64アーキテクチャ - ばびろん’s すたっくメモリアクセス(続き)

Reference: What Is Paging? - e-Words IT Dictionary

To enable the paging mechanism on an x86 CPU, the PG flag of CR0 (Control Register 0) must be set to 1.

Let’s look at the part that actually enables paging.

In the previous article, I set the PE flag of CR0 (Control Register 0) when transitioning to protected mode — the approach here is almost identical.

entry:
  # Turn on page size extension for 4Mbyte pages
  movl    %cr4, %eax
  orl     $(CR4_PSE), %eax
  movl    %eax, %cr4
  # Set page directory
  movl    $(V2P_WO(entrypgdir)), %eax
  movl    %eax, %cr3
  # Turn on paging.
  movl    %cr0, %eax
  orl     $(CR0_PG|CR0_WP), %eax
  movl    %eax, %cr0

The processing after # Turn on paging. at the end is where the PG flag of CR0 (Control Register 0) is set.

The constants used for the flag operations are each defined as follows.

// Control Register flags
#define CR0_PE          0x00000001      // Protection Enable
#define CR0_WP          0x00010000      // Write Protect
#define CR0_PG          0x80000000      // Paging

#define CR4_PSE         0x00000010      // Page size extension

From this we can see that not only the PG flag but also the WP flag is being set.

When the WP flag is set, the CPU can prevent ring-0 supervisor-level procedures from writing to read-only pages.

This makes it easier to implement the copy-on-write mechanism when creating new processes in the OS.

I will write about this in a future article.

However, I do wonder why it is being set explicitly here, since the WP flag should be set by default on x86 CPUs.

Reference: Control register - Wikipedia

Reference: assembly - whats the purpose of x86 cr0 WP bit? - Stack Overflow

Next, let’s look at the following code, which comes slightly before the CR0 setup.

# Turn on page size extension for 4Mbyte pages
movl    %cr4, %eax
orl     $(CR4_PSE), %eax
movl    %eax, %cr4

Here, the PSE flag of CR4 (Control Register 4) is being set.

This flag controls the size of a single page.

When the PSE flag of CR4 is not set (the default), the page size is 4 KiB.

Conversely, when the PSE flag is set, the page size is extended to 4 MiB.

Reference: Control register - Wikipedia

I will cover the detailed background of why two page sizes exist in a separate article if the opportunity arises.

We now know that xv6 OS uses a page size of 4 MiB.

Finally, let’s look at the following:

# Set page directory
movl    $(V2P_WO(entrypgdir)), %eax
movl    %eax, %cr3

The paging mechanism in xv6 OS is enabled on the very next line, so paging is not yet active at this point.

Therefore, the $(V2P_WO(entrypgdir)) macro is used to convert the address of entrypgdir to a physical address before writing it to CR3.

CR3 is a register used when the paging mechanism is active; the x86 CPU uses it to reference the page directory and page table and convert linear addresses to physical addresses.

entrypgdir is a struct array defined in main.c.

// main.c
pde_t entrypgdir[];  // For entry.S

// The boot page table used in entry.S and entryother.S.
// Page directories (and page tables) must start on page boundaries,
// hence the __aligned__ attribute.
// PTE_PS in a page directory entry enables 4Mbyte pages.

__attribute__((__aligned__(PGSIZE)))
pde_t entrypgdir[NPDENTRIES] = {
  // Map VA's [0, 4MB) to PA's [0, 4MB)
  [0] = (0) | PTE_P | PTE_W | PTE_PS,
   
  // Map VA's [KERNBASE, KERNBASE+4MB) to PA's [0, 4MB)
  [KERNBASE>>PDXSHIFT] = (0) | PTE_P | PTE_W | PTE_PS,
};

The array size is NPDENTRIES, which is defined in mmu.h as 1024:

// Page directory and page table constants.
#define NPDENTRIES      1024    // # directory entries per page directory
#define NPTENTRIES      1024    // # PTEs per page table
#define PGSIZE          4096    // bytes mapped by a page

entrypgdir has two elements.

Honestly, I only have a rough sense of what is happening here, but it appears to simply initialize the page directory entries.

First, the line (0) | PTE_P | PTE_W | PTE_PS, which is common to both elements, defines the following:

0 — set all bits to 0
PTE_P — set present
PTE_W — set read/write
PTE_PS — set the 4 MiB page size bit

The first element, [0] = (0) | PTE_P | PTE_W | PTE_PS,, initializes the 0th page directory entry to this value.

The next element initializes the KERNBASE>>PDXSHIFT = 0x80000000 >> 22 = 512nd page directory entry to this value.

This initialization appears to be used when paging is subsequently enabled and execution transfers to the main function.

Reference: xv6: OSはどうメモリを参照、管理するのか（前編） - yohei.codes

Reference: what does this code mean in xv6 entrypgdir? - Stack Overflow

Setting the Stack Pointer

Finally, the stack pointer is set up before transferring to the main function.

# Set up the stack pointer.
movl $(stack + KSTACKSIZE), %esp

KSTACKSIZE is defined as 4096 in param.h.

#define NPROC        64  // maximum number of processes
#define KSTACKSIZE 4096  // size of per-process kernel stack
#define NCPU          8  // maximum number of CPUs
#define NOFILE       16  // open files per process
#define NFILE       100  // open files per system
#define NINODE       50  // maximum number of active i-nodes
#define NDEV         10  // maximum major device number
#define ROOTDEV       1  // device number of file system root disk
#define MAXARG       32  // max exec arguments
#define MAXOPBLOCKS  10  // max # of blocks any FS op writes
#define LOGSIZE      (MAXOPBLOCKS*3)  // max data blocks in on-disk log
#define NBUF         (MAXOPBLOCKS*3)  // size of disk block cache
#define FSSIZE       1000  // size of file system in blocks

Setting the stack pointer is necessary to transfer to C code, but I honestly could not fully understand this part.

The reason is that the variable stack is defined in main.c, and at this point no value has been stored in it yet.

As a result, it appears to be defined as a .comm symbol, intended to be redefined later.

Reference: c - assembly - mov unitialized variable? - Stack Overflow

Quite tricky…

Transferring to the main Function

The series of processing that began during bootstrap finally comes to an end here, and execution transfers to the main.c function, which is the kernel body.

# Jump to main(), and switch to executing at
# high addresses. The indirect call is needed because
# the assembler produces a PC-relative instruction
# for a direct jump.
mov $main, %eax
jmp *%eax

.comm stack, KSTACKSIZE

This has gotten quite long, so I will continue in the next article.

Summary

In this article, I traced through the kernel program build process, the linker script, and the flow of execution at the entry point.

Next time, we should finally be able to trace the behavior of the kernel body itself.

References

Published Jan 16, 2022

Aspiring Reverse Engineer and CTF Player (Team: 0nePadding). Passionate about WinDbg and Anti-Virus internals. OSCP / CISSP. Working at Microsoft Japan, but all views expressed are my own.かしわば(@kash1064) on Twitter

Seriously Reading the xv6 OS to Fully Understand the Kernel — Linker and Paging Edition

Table of Contents

Loading the Kernel

Building the Kernel Program

Linker Script

Linker Script Structure

Defining the Entry Point

SECTIONS: Defining the text Section

SECTIONS: Defining the rodata Section

SECTIONS: Defining the stab and stabstr Sections

SECTIONS: Defining the data Section

SECTIONS: Defining the bss Section

SECTIONS: DISCARD

Kernel Entry Point

Multiboot Header

Defining the Physical Address of the Entry Point

Loading the Kernel Entry Point

What Is Paging?

Setting the Stack Pointer

Transferring to the main Function

Summary

References