XEN

For those of you who are linking to this page from outside the main OZONE site, it can be seen here. OZONE is an OS I wrote myself that combines my favourite parts of VMS, Linux and WindoesNT. I am in the process of porting OZONE to work under XEN.

Warning!
This is simply a description of my porting experience with some hopefully helpful hints.
Don't take any of it as definitive.
If you have any corrections or things to ask about or things to add, please let me know!
Also note that the stuff below pertains to XEN V1.2!

XEN is yet another x86 CPU / PC system emulator program (yaxcpsep). Well, not quite. XEN provides virtual environments of the x86 nature, but they are not PC's. I say that because the virtual machine it gives an OS doesn't have IO ports and won't allow access to ring 0, for example. There are other restrictions.

These restrictions are quite viable, however. Linux has been ported and works nicely. OZONE has a compile option to move most of the kernel into ring 1 anyway, so it shouldn't be too much work to get it to run with XEN.

To perform privileged functions such as altering pagetables, an OS must make what they call 'HYPERVISOR' calls. These are int $0x82 traps that the XEN monitor processes. There are calls for accessing pagetable entries, stuff like that. Doing it this way allows the virtualization to happen without any instruction translation or scanning, so the virtual machines execute mostly at full speed. The downside is that the OS must be ported to run on a new architecture, including device drivers for accessing the virtual devices. Yes, a ring 1 program can SGDT, but since the OS is ported, it won't do these as the value is meaningless. If an application does an SGDT, it is just as meaningless as it was in the original OS.


Table of Contents


The first thing to do is get XEN running on a PC. The PC I used is an ASUS P2B-L with:

I first tried the thing with 64MB ram. It puqued violently (AIEE!!)

So anyway, here's what I did:

  1. Install RedHat 9 on the idiot PC. I partitioned /dev/hda like this:
    • hda1 100Meg /boot partition
    • hda2 350Meg swap partition
    • hda3 1.2Gig standard linux root partition
    • hda4 to be used later
    I also told it to ignore eth0 and set eth1 with IP address 192.168.0.150, and with appropriate gateway and nameservers. Be sure to include either inbound ssh or telnet capability in your installation. I omitted all windowing and GUI stuff.
  2. Init /dev/hda4 then copy the whole /dev/hda3 partition to it.
  3. Copy the xen.gz and xenolinux.gz files from the CDROM to the /dev/hda1 boot directory
  4. Create an empty file called this_is_really_hda3 in /dev/hda3 root directory
  5. Create an empty file called this_is_really_hda4 in /dev/hda4 root directory
  6. Edit the /boot/grub/grub.conf file to add an entry to boot using hda4 as the root and the xen stuff as the kernel. Since I only have 320MB, I made the dom0 memory only 64Meg. I also had to tell it to use eth1 for ethernet, as my eth0 is broken.
  7. Reboot, selecting the Xen bootline from the Grub screen. It should just work and you can log in.
  8. Copy the xeno-1.2.bk directory tree from the CDROM to /usr/local/src or something and build and install the tools
  9. Add the following lines to the end of the hda3's (ie, Dom0's) etc/rc.d/rc.local file:
    	#
    	# Start XEN virtual ethernet
    	#
    	echo "rc.local: starting XEN virtual ethernet eth0"
    	/usr/bin/xc_dom_control.py vif_ipadd 0 0 192.168.0.151
    	/sbin/ifconfig eth0 192.168.0.151
    	/sbin/route add -net 0.0.0.0 netmask 0.0.0.0 gw 192.168.0.1
    	echo "rc.local: virtual eth0 startup complete"
    
    Substitute in your own IP addresses for the 192.168... numbers.
  10. Reboot or execute the above commands manually so you can access the ethernet.


Starting a Virtual Machine

The above gets the stuff going with the control virtual machine running (what they call Domain 0). Now the idea is to get another virtual machine going. Here's what I did:

  1. Create partition /dev/hdb1 (about 500Meg) and /dev/hdb2 (about 100Meg). The /dev/hdb1 partition is for the VM's root partition. The /dev/hdb2 partition is for the VM's swap partition.
  2. Copy all directories, except /usr, from physical /dev/hda3 to /dev/hdb1. Also do a mkswap /dev/hdb2.
  3. Create a directory called /hda3 on the /dev/hdb1 drive.
  4. Make a softlink on /dev/hdb1 called /usr that points to /hda3/usr.
  5. Edit the /dev/hdb1's /etc/fstab file to mount /dev/hda3 at mountpoint /hda3 with read-only access.
  6. It'd also be a good idea to remove the LABEL=/ stuff and hardcode in the /dev/hdb1 device name.
  7. Also change the swap partition to /dev/hdb2.
  8. Remove hdb1's /etc/rc.d/init.d/kudzu file or it causes grief on startup.
  9. Remove the respawn:/sbin/mingetty entries from hdb1's /etc/inittab file as there are no tty's in the virtual machine
  10. Edit hdb1's /etc/sysconfig/network and /etc/hosts files to contain the VM's hostname
  11. Put these lines at the end of its /etc/rc.d/rc.local file to start the ethernet:
    	echo "rc.local: start ethernet..."
    	/sbin/ifconfig eth0 192.168.0.152
    	/sbin/route add -net 0.0.0.0 netmask 0.0.0.0 gw 192.168.0.1
    	echo "rc.local: ifconfig..."
    	/sbin/ifconfig
    	echo "rc.local: done."
    
  12. Unmount the /dev/hdb1 partition
  13. Create an /etc/xc/vm1 file from the /etc/xc/defaults file. Here are the changes I made:
    • image = "/boot/xenolinux.gz"
    • vfr_ipaddr = ["192.168.0.152"]
    • vbd_list = [ ('phy:hdb1','hdb1','w'), ('phy:hdb2','hdb2','w'), ('phy:hda3','hda3','r') ]
    • cmdline_root = "root=/dev/hdb1 ro"
    • cmdline_extra = "4 VMID=%d" % vmid
It should just work now. Do this to get it started:
  1. xen_nat_enable
  2. xen_read_console &
  3. xc_dom_create.py -D vmid=1 -f /etc/xc/vm1
These are 'normal' errors you will get on startup: You should be able to ssh or telnet into the VM using its IP address.


Porting an OS -- Part I

We figure once we know how to get a VM to start, we can port an OS. So I started with an hello world programme, (helloworld.s):

        .text

        .globl  _start
_start:
        cld

        # from include/asm-xeno/hypervisor.h

        movl    $2,%eax                 # __HYPERVISOR_console_write (include/asm-xeno/hypervisor-ifs/hypervisor-if.h)
        movl    $hello_message,%ebx     #  arg1 = buffer virtual address
        movl    $hello_message_len,%ecx #  arg2 = buffer length
        int     $0x82

        # from include/asm-xeno/hypervisor.h

        movl    $8,%eax                 # __HYPERVISOR_sched_op
        movl    $1,%ebx                 # SCHEDOP_exit
        int     $0x82

hang:   jmp     hang                    # shouldn't get here

hello_message:  .ascii  "This is the hello world program\n"
        hello_message_len = . - hello_message
The Xen loader also wants a 12-byte header on the image file. So I wrote a little assembler module (xenoguestheader.s) to handle that:
        .text
        .globl  _start
_start:
        .ascii  "XenoGues"      # read_kernel_header (tools/xc/lib/xc_linux_build.c)
        .long   _start          # - the kernel's load address

The final image has to consist of the 12-bytes object code from xenoguestheader.s followed the object code from helloworld.s. Here is my makefile to accomplish that:

helloworld.gz: helloworld.s xenoguestheader.raw
        as -o helloworld.o -a=helloworld.l helloworld.s
        ld -Ttext 0x100000 -o helloworld.elf helloworld.o
        objcopy -O binary -S -g helloworld.elf helloworld.raw
        cat xenoguestheader.raw helloworld.raw | gzip > helloworld.gz

xenoguestheader.raw: xenoguestheader.s
        as -o xenoguestheader.o xenoguestheader.s
        ld -Ttext 0x100000 -o xenoguestheader xenoguestheader.o
        objcopy -O binary -S -g xenoguestheader xenoguestheader.raw
Note that both the helloworld and xenoguestheader are linked at 0x100000. I first tried putting the 12-byte header at the beginning of helloworld.s and not having a separate xenoguestheader module. The result was that it printed hello world program instead of This is the hello world program. Notice that there were 12 missing characters? Crikey! Had I simply programmed Hello world nothing would have printed and I'd still be trying to figure it out!

All I had to do in the /etc/xc/vm2 file was to specify:

Here is the result of an xc_dom_create.py -D vmid=2 -f /etc/xc/vm2 command:
[root@xenophilia xc]# xc_dom_create.py -D vmid=2 -f /etc/xc/vm2
Parsing config file '/etc/xc/vm2'
VM image           : "/test/helloworld.gz"
VM ramdisk         : ""
VM memory (MB)     : "64"
VM IP address(es)  : "192.168.0.153"
VM block device(s) : ""
VM cmdline         : "ip=192.168.0.153:169.254.1.0:192.168.0.1:255.255.255.0::eth0:off root=/dev/hdb3 ro 4 VMID=2"
VM started in domain 52
[52] This is the hello world program
[root@xenophilia xc]#
So now it is a simple matter of programming to port OZONE.


Porting an OS -- Part II

The next thing to do is to catalog the calls that the hypervisor makes available to guest OS's. We can start by looking in the include/asm-xeno/hypervisor.h file that's in the xenolinux directory, and scanning the linux code to see where they are used. Another place to look is the hypervisor source code, like look for do_stack_switch to find the code for HYPERVISOR_stack_switch.

The environment seems to be similar to what the Alpha's console sets up, except that all physical memory is mapped to virtual addresses.

There are three types of addresses used:

Also, in the doc below, page number refers to the address shifted down 12 bits, eg, page 2 refers to the page containing addresses 0x2000..0x2FFF, be it virtual, pseudo-physical or machine.

  Initialization:

    (start_info_t *) : startup information, pointed to by %esi on initial jump to kernel
                       see include/hypervisor-ifs/hypervisor-if.h
                       number of pages, shared info struct pointer, page directory VA, 
                       where loaded modules are, command line

      Here's what I get for my "hello world" boot (after inserting printk's):

        si -> nr_pages    16384     <- 16K pages = 64M bytes
        si -> shared_info 0x294000  <- real-world physical address of shared_info struct
        si -> dom_id      3
        si -> flags       0
        si -> pt_base     0x40FF000 <- virtual address of my page directory page
        si -> mod_start   0
        si -> mod_len     0
        si -> cmd_line    ip=192.168.0.153:169.254.1.0:192.168.0.1:255.255.255.0::eth0:off root=/dev/hdb3 ro 4 VMID=2

      I don't know what Xen guarantees about its position in memory, but to 
      be safe, I copy it to a page that is part of my kernel image.  That 
      way I know I can use any pages after my kernel for whatever I want.

    (shared_info_t *) HYPERVISOR_shared_info : a shared communication struct
                                               its machine address given by start_info->shared_info
                                               map to VA space with HYPERVISOR_update_va_mapping
                                               this page is not mapped as part of your initial VM's physical pages

      -> events, event_mask : bitmask of events to process via hypervisor_callback
      -> various : contains current date/time information (see arch/xeno/kernel/time.c)

    HYPERVISOR_set_callbacks (codesegment, (unsigned long)hypervisor_callback, 
                              codesegment, (unsigned long)failsafe_callback)

      hypervisor_callback = callback to process async events
                            bitmask of events available in HYPERVISOR_shared_info -> events (atomic access)
                            return via iret (see arch/xeno/kernel/entry.S, arch/xeno/kernel/hypervisor.c)

      failsafe_callback = a pseudo-pagefault happened while serving an int $0x82 request
                          (see arch/xeno/kernel/entry.S, arch/xeno/kernel/traps.c)

    HYPERVISOR_set_trap_table (trap_table) : set callback for traps, faults (accvio, divbyzero, int instrs, etc)
                                             (see arch/xeno/kernel/traps.c)

    HYPERVISOR_stop (virt_to_machine (suspend_record) >> PAGE_SHIFT);

  Print a message on console:

    HYPERVISOR_console_write (buf, len)

      buf = virtual address of message string
      len = length in bytes of message string

  Let someone else do something (HLT replacement):

    HYPERVISOR_yeild ()

  Terminate (last step of guest OS shutdown):

    HYPERVISOR_exit ()

  Access debug register:

    HYPERVISOR_set_debugreg (registernumber, value)
    HYPERVISOR_get_debugreg (registernumber)

  Thread (stack) switching:

    HYPERVISOR_stack_switch (new_stack_segment, new_stack_pointer)
      Set ring 1 stack pointer in TSS
        arg 1 : stack segment
        arg 2 : stack pointer

    HYPERVISOR_fpu_taskswitch
      Set CR0's TS bit, then does usual exception through vector 7 if FPU accessed
        *but* it clears the TS bit for you before calling your exception handler

  Process (pagetable) switching:

    (use MMUEXT_NEW_BASEPTR below to load CR3)

  Pagetable calls:

    HYPERVISOR_mmu_update (ureqs, count);

      Perform array of mmu updates

        arg1 : array of updates
          ureqs[i].ptr & 3 = MMU_NORMAL_PT_UPDATE - updates an arbitrary pagetable entry
                                                    V1.2: .ptr = top bits give virtual address of pagetable entry to update
                                                    V1.3: .ptr = top bits give machine address of pagetable entry to update
                                                    .val = pagetable entry value to set it to
                              MMU_MACHPHYS_UPDATE - updates READONLY_MPT_VIRT_START table entry
                                                    .ptr = top bits give machine address
                                                    .val = pseudo-physical page number
                             MMU_EXTENDED_COMMAND - subcommand in low bits of .val:
                                MMUEXT_PIN_L1_TABLE : validate L1 pagetable page
                                                      .ptr top bits = page's machine address
                                MMUEXT_PIN_L2_TABLE : validate L2 pagetable (pagedirectory) page
                                                      .ptr top bits = page's machine address
                                 MMUEXT_UNPIN_TABLE : unpin L1 or L2 pt page
                                                      .ptr top bits = page's machine address
                                 MMUEXT_NEW_BASEPTR : set up new pagetable (loads CR3)
                                                      .ptr top bits = pagedirectory page's machine address
                                   MMUEXT_TLB_FLUSH : flushes all TLB entries (reloads CR3)
                                                      .ptr top bits = 0
                                      MMUEXT_INVLPG : flush one TLB entry
                                                      .ptr top bits = 0
                                                      .val top bits = virtual address to invalidate
        arg2 : number of updates

    HYPERVISOR_update_va_mapping (virtualpagenumber, entry, flags);

      Write a single pagetable entry (for such as servicing pagefaults)

        arg1 : virtual page number that PTE maps, ie, the VA>>12 that faulted
        arg2 : contents to write to pagetable entry, using machine address
               no translation is performed on arg2 before writing it to the pagetable
               but it is checked to be sure you are mapping one of your pages
        arg3 : UVMF_INVLPG : invalidate page
            UVMF_FLUSH_TLB : reloads CR3, flushing all TLB entries

  Disk IO:

    HYPERVISOR_block_io_op(&op);

  Network IO:

    HYPERVISOR_net_io_op(&netop);
    HYPERVISOR_network_op(&op);

  Memory Layout when control passed to Guest OS:

    HYPERVISOR_VIRT_START = 0xFC000000 // Virtual addresses beyond this are not 
                                          modifiable by guest OSes

    Within that space, there is a 4MB table of longs at READONLY_MPT_VIRT_START that 
    maps a machine page number to a pseudo-physical page number

    All phsyical memory requested in the startup is mapped starting at that address.  
    So if you asked for 64M, and your kernel loads at 0xC0000000, Xen will start you 
    with memory mapped at 0xC0000000..0xC3FFFFFF.

    The kernel is loaded at the low end of that memory.  Xen puts a pagedirectory page 
    at the high end and the pagetable pages just below that.

    When I start my hello world program, the pagetables are set up like this:

      pde[  0] = 05018067                     <- there are enough pagedirectory entries to cover the whole 64Meg
      pde[  1] = 094F7067
      pde[  2] = 0D5AE067
      pde[  3] = 0D5AD067
      pde[  4] = 0D5AC067
      pde[  5] = 0D5AB067
      pde[  6] = 0D5AA067
      pde[  7] = 0D5A9067
      pde[  8] = 0D5A8067
      pde[  9] = 0D5A7067
      pde[  A] = 0D5A6067
      pde[  B] = 0D5A5067
      pde[  C] = 0D5A4067
      pde[  D] = 0D5A3067
      pde[  E] = 0D5A2067
      pde[  F] = 0D5A1067
      pde[ 10] = 0D5A0067

                                              <- the first 1Meg is unmapped
      pte[         100] = 015A1023            <- this is where my 'hello world' program is loaded
      pte[         101] = 095B3023
      pte[         102] = 095B4063
      pte[  103..  105] = 095B5023..095B7023
      pte[         106] = 05018025            <- this is the pte I use to look at the pagetable pages
      pte[  107..  3FF] = 095B9023..098B1023  \
      pte[  400..  7FF] = 098B2023..09CB1023   |
      pte[  800..  BFF] = 09CB2023..0A0B1023   |
      pte[  C00..  FFF] = 0A0B2023..0A4B1023   |
      pte[ 1000.. 13FF] = 0A4B2023..0A8B1023   |
      pte[ 1400.. 17FF] = 0A8B2023..0ACB1023   |
      pte[ 1800.. 1BFF] = 0ACB2023..0B0B1023   |
      pte[ 1C00.. 1FFF] = 0B0B2023..0B4B1023   |
      pte[ 2000.. 23FF] = 0B4B2023..0B8B1023   |  Here is the bulk of my usable pages
      pte[ 2400.. 27FF] = 0B8B2023..0BCB1023   |
      pte[ 2800.. 2BFF] = 0BCB2023..0C0B1023   |
      pte[ 2C00.. 2FFF] = 0C0B2023..0C4B1023   |
      pte[ 3000.. 33FF] = 0C4B2023..0C8B1023   |
      pte[ 3400.. 37FF] = 0C8B2023..0CCB1023   |
      pte[ 3800.. 3BFF] = 0CCB2023..0D0B1023   |
      pte[ 3C00.. 3FFF] = 0D0B2023..0D4B1023   |
      pte[ 4000.. 40ED] = 0D4B2023..0D59F023  /
      pte[ 40EE.. 40FC] = 0D5A0021..0D5AE021  <- here are pagetable pages for pde[2..100] = VA 800000..403FFFFF
      pte[        40FD] = 094F7021            <- here is the pagetable page for pde[1] = VA 400000..7FFFFF
      pte[        40FE] = 05018021            <- here is the pagetable page for pde[0] = VA 000000..3FFFFF
      pte[        40FF] = 03F8A021            <- this is the pagedirectory page pointed to by si -> pt_base
                                                    not on a 4Meg boundary and not self-referencing

        ** And that covers the 64Meg I asked for in my config file **

    There are also entries starting at pde[3F0] (VA FC000000) and up mapping Hypervisor internal data.

    I experimented by moving hello world to link and load at VA 0x1000.  Xen set up 
    everything starting at 0x1000 and apparently hides the 'hole.'  Likewise, when I link at 
    base 0x400000, it leaves the first 4M unmapped and puts everything after that.


Porting an OS -- Part III

This shall be my pseudo-FAQ section. After all, it's like pseudo-physical pages, and how frequently is this stuff asked, really?

How are pseudo-physical page numbers assigned? Xen assigns each domain a range of pseudo-physical page numbers starting at zero through the requested memory size. Then Xen allocates real physical pages for them all and maps them to your virtual machine, virtually contiguous, starting at your kernel's load address up to where it runs out.

How do I translate pseudo-physical page number to its real-world physical (machine) page number? Look it up in the corresponding initial pagetable entry passed to you on initialization. Since Xen maps every page given you, the initial pagetables serve as a list of machine pagenumbers assigned to your virtual machine.

How do I translate a machine page number to its pseudo-physical page number? Xen provides a read-only table (starting at virtual address 0xFC000000=READONLY_MPT_VIRT_START) that maps each machine page number to the corresponding pseudo-physical page number. It has one entry per real machine physical memory page. Theoretically, you could scan the table and find out translations for other VM's machine pages, but it wouldn't be of any use.

Should my pagetable entries contain pseudo-physical or real-world machine page numbers? They should contain real-world machine page numbers. Xen performs no translation on the entries you write, it just validates that you own the page being pointed to. This is consistent with you being able to read the pagetable entries directly.

What is this pagetable pinning about? It is used to tell Xen which of your pages are being used for pagetable pages. Apparently you can port an OS without using it. However, each time you tell Xen to load CR3 with a new set of tables, it would have to verify them all to make sure they point only to your pages and no one else's. By pinning them, you are telling Xen that these are pagetable pages and it should verify them this once and assume they will not be changed except via Xen calls. Of course, as with all pages being used for pagetables, you must mark them read-only first!

Do I have to provide a GDT? You can just use the segment registers provided by Xen if you want. Xen provides you with ring 1 and ring 3 code and data segments. They have a base of zero and a limit of FC3FFFFF (the area from FC000000..FC3FFFFF is marked read-only). Thus, if you use the flat memory model, you don't have to have any segment register programming in your OS at all, except a GP fault handler that puts the registers right should some nitwit application mess wit' dem, and maybe some code will test CS or SS for ring 1 vs ring 3.

How are exceptions reported? You tell Xen what exceptions you want to handle by calling HYPERVISOR_set_trap_table as part of your initialization code. You give it a list of (vector,dpl,cs,offset) quadruples describing each vector you want to handle. For my OS port, the cs parameter is always FLAT_RING1_CS. dpl is the privilege level you want to give access to this vector, 0=CPU exceptions only; 1=your ring 1 code; 3=usermode code. If you add 4, it will also disable event delivery (but it doesn't save the prior state). offset is the address of your servicing routine. Exception vectors, especially those with error codes, should be declared with dpl zero to prevent usermode code from doing an int instruction to call your routine, as there would be no error code(s) pushed on the stack.

The servicing routine is entered just like the bare CPU chip would do. If the exception is defined with an error code, it will be pushed on the stack just like the CPU chip does it. There are two exceptions:

In any case, you must pop any error code(s) from the stack before doing an iret.

What horrors await me in handling events? Event handling is fairly straightforward. It enters your handler with a 'bare' stack ready for an iret, just like real hardware would enter an interrupt handler. You process the events that are tagged in shared_info.events, atomically clearing the bits. When finished, restore shared_info.events_mask, restore registers and iret. One thing that got me was that Xen clears both the individual event enable bit for the events it is delivering for and the master enable bit. I suppose this might have been an effort to emulate EFLAGS IF and 8259s needing an EOI, but it made my work a little more difficult. Also, there is no hypervisor call to do an iret and restore the mask at the same time, so you must do some messy stuff to exit your handler to make sure you don't miss an event and not eat up your kernel stack.


Porting an OS -- Part IV

So now the question becomes, how can OZONE be best fit into this? It seems that Xen maps all requested pages to virtual address space when the virtual machine is booted. So if I statically linked the OZONE kernel at, say, 0xC0000000, that would give me a maximum physical memory size of 1024M-64M=960M (minus some amount for OZONE's paged pool and any dynamic images). That seems reasonable for now anyway. VAX/VMS puts its kernel at 0x80000000 so I could do that with OZONE if there is a system with over 768M virtual machine memory to support.

OZONE does not require memory in general to be mapped to VA space, but it seems to be the way Xen works, so be it. The standard X86 OZONE uses a pair (per CPU) of pagetable entries to dynamically access memory by physical address. To do this under XEN would require making an HYPERVISOR_mmu_update call each time which would slow it way down. So for now, I am going to just leave all physical memory mapped and access it that way.

OZONE's memory manager is based on what it thinks are physical page numbers using an array starting at zero thru the top physical page number - 1. So there are these practical choices to create and index the array:

Using the second option is probably the best because: The real downside of this approach would be if we had device drivers that did DMA and required true physical addresses. As was the case for the Alpha port, the drivers do not assume that physical memory address is the same as DMA bus address. So if Xen someday allows a VM to access a controller doing DMA, that version of OZONE could provide the appropriate layer for mapping DMA transfers. Right now I have oz_dev_pci_486.c and oz_dev_pyxis_axp.c to do the mapping, so I would just have to add something like oz_dev_pci_xen.c.

Another thing we can probably do is get rid of OZONE's loader. It primary purpose was so someone could alter the startup parameters. Secondarily, someone could copy some installation files before booting the kernel. Since with Xen you get a full functioning OS (as Domain 0), these functions can be performed there. So our oz_kernel_xen.s is going to receive control directly from the Xen loader. Also, come to think of it, we don't have a console to read interactive loader commands from anyway!

So far, my memory looks like:

                                                      Virtual Address
        +------------------------------------------------+
        |                                                |
        |   XEN Hypervisor                               |
        |                                                |    <- FC000000  (4M boundary)
        +------------------------------------------------+
        |                                                |
        |   Used for kernel expansion                    |    <- used for dynamic kernel images, paged pool
        |                                                |
        +------------------------------------------------+
        |                                                | \
        |   Remaining free page mapping                  | |
        |                                                | |
        +------------------------------------------------+ |
        |                                                | |
        |   System global pagedirectory and table        | | all of the VM's physical memory is mapped
        |                                                | | ... in here as given by Xen on startup
        +------------------------------------------------+ |
        |                                                | |
        |   Kernel image as loaded by XEN                | |
        |                                                | /  <- C0000000  (4M boundary)
        +------------------------------------------------+
        |                                                |
        |   Per-process stack                            |
        |                                                |
        +------------------------------------------------+
        |                                                |
        |   Per-process code and heap                    |
        |                                                |    <- 00800000  (4M boundary)
        +------------------------------------------------+
        |                                                |
        |   Per-process pagetable                        |
        |                                                |    <- 00400000  (4M boundary)
        +------------------------------------------------+
        |                                                |
        |   "Requested page protection" table            |    (holds page protection bits requested by application)
        |                                                |    <- 003C0000
        +------------------------------------------------+
        |                                                |
        |   Per-process "pdata" array                    |    (holds per-process things like user & kernel malloc listhead)
        |                                                |    <- 003BE000
        +------------------------------------------------+
        |                                                |
        |   No Access                                    |
        |                                                |    <- 00000000
        +------------------------------------------------+
I put my per-process pagetables and other stuff at the low end of virtual memory because... The area just above the pagetables (starting at 0x00800000) up to the user image (0x07FFFFFF) are valid virtual address. So the general malloc routine will start putting its heap in there, and dynamic images can be loaded there as well. The thread creation routine maps user stacks to the high end of per-process virtual addresses, just below the kernel.

XEN starts you off with the page directory and table pages at the high-end of virtual memory. I suppose I could have used them there as is, except I don't know if XEN guarantees it will always put them there and if they will always be there in order. So I swap them around (just change the mapping, using the same physical pages). I put the directory immediately after the kernel image followed by the pagetable pages. Then I follow it by some pages of zeroes to pad out to the highest system pagetable pages I want (to map the "Used for kernel expansion" area).

Padding it with the zeroes also will make the upper portion of the page directory static, so all my page directories will have the same upper-end contents forever. I don't have to worry about keeping all the per-process directories updated.

I use spinlock levels 0xA0 through 0xBE to correspond to events 0 through 30. So when I set the spinlock level to 0xA0, it blocks event 0. When it is set to 0xA1 it blocks events 0 and 1; at level 0xA2 it blocks 0, 1 and 2, etc. Any level below 0xA0 will not block any event deliveries, any level at or above 0xBE will block all event deliveries. This is analagous to my using levels 0xE0..0xEF for the irq levels in the bare hardware x86 version of Ozone.

So when I set spinlock 0xA0, I clear mask bit <0>, and the mask is 0x?FFFFFFE. Setting spinlock 0xA9 sets the mask to 0x?FFFFC00. The top bit (master enable), is independent of spinlock level, as in OZONE, you can have spinlocks either with or without hardware interrupts being enabled. So the mask at level 0xA0 might be either 0x7FFFFFFE (master enable off) or 0xFFFFFFFE (master enable on).


Porting an OS -- Part V

Fun facts I am finding out:

Porting an OS -- Part VI -- Startup outline

My startup sequence looks like:

  1. Set esp to pages within the kernel image so it won't get tromped on when switching pages around. Call HYPERVISOR_stack_switch to tell Xen I have moved it.
  2. Copy start_info struct to a page within the kernel image for the same reason.
  3. Zero out my kernel image BSS section (I don't know if Xen's loader does this or not).
  4. Initialize my event level spinlocks (0xA0..0xBE)
  5. Make sure my read-only kernel image pages are readonly and also readable by user mode. The way I do things, my kernel just looks like a big shareable library to application programs, and therefore must be readable by user mode.
  6. Likewise, make sure the read/write pages are not readable or writeable by user mode.
  7. Move the given pagedirectory and pagetable pages just after the kernel image and leave room for more pagetable pages.
  8. Map the shared_info struct to a virtual address linked in kernel image.
  9. Calculate timekeeping factors
  10. Set up trap table to handle CPU-generated exceptions (like pagefault, div-by-zero, syscall int, etc)
  11. Set up event callback routine to handle asychronous event deliveries and enable event delivery
  12. Parse command-line parameters in the start_info struct
Then it jumps to my hardware independent startup routine, oz_knl_boot_firstcpu, which sets up pool space, initializes modules, mounts the system disk and spawns the startup shell.


Disk IO

In V1.2, disk IO is performed by placing requests in a ring buffer then signalling the hypervisor to process them. The hypervisor then replaces the requests with responses and signals their presense by calling your asynchronous event handler with the BLKDEV bit set in events.

The first thing you must do is reset the ring addresses and map the ring buffer to your virtual address space. There is just one ring buffer that is shared among all the virtual disks. To get its machine address:

      op.cmd = BLOCK_IO_OP_RESET;
      rc = HYPERVISOR_block_io_op (&op);
      if (rc != 0) {
        oz_knl_printk ("oz_dev_xendisk_init: error %d resetting ring\n", rc);
        return;
      }

      op.cmd = BLOCK_IO_OP_RING_ADDRESS;
      HYPERVISOR_block_io_op (&op);

      machine_address = op.u.ring_mfn << 12;
Do whatever calls in your OS to get an unassigned system pagetable entry suitable for mapping (like you would for accessing memory-mapped IO registers in a real system), then call HYPERVISOR_mmu_update to map it to the machine address of the ring buffer.

Next, you need to find out what virtual disks are set up for you to use in your virtual machine.

      memset (&op, 0, sizeof op);
      op.cmd = BLOCK_IO_OP_VBD_PROBE;
      op.u.probe_params.domain    = 0;
      op.u.probe_params.xdi.max   = MAX_VBDS;
      op.u.probe_params.xdi.disks = vbd_info;
      op.u.probe_params.xdi.count = 0;

      rc = HYPERVISOR_block_io_op (&op);
      if (rc != 0) {
        oz_knl_printk ("oz_dev_xendisk_init: error %d probing number of vbds\n", rc);
        return;
      }

      number_of_defined_disks = op.u.probe_params.xdi.count;
Then loop through the vbd_info array to get the info about each disk. The only two elements I needed were: Call whatever routines in your OS you need to to define device table entries for any virtual disks.

There are two ring array indices provided in the ring struct that you mapped:

You will need to provide a third (in your own private memory somewhere): All three indices must be initialized to zero when you start. When incrementing, if they reach BLK_RING_SIZE, reset them to zero immediately (ie, they should always remain in the range 0..BLK_RING_SIZE-1). Conceptually, they are in the order (mod BLK_RING_SIZE):
      resp_cons <= resp_prod <= req_prod

When all three indices are equal, it means the ring is empty. You must not increment req_prod all the way 'round to resp_prod or it will get confused with an empty condition, so you must always leave at least one empty spot in the ring. When inserting or removing items from the ring using req_prod or resp_prod, be sure to place memory barriers between accessing the indices and accessing the contents of the slots, as Xen may be accessing them with another CPU:

      while (there is a request to queue) {
        indx = blk_ring -> req_prod;
        if ((indx + BLK_RING_SIZE - resp_cons) % BLK_RING_SIZE) >= BLK_RING_SIZE - 1) break;
        fill in blk_ring -> ring[indx].req with request
        MB to make sure hypervisor will see a valid blk_ring -> ring[resp_cons].req
        if (++ indx == BLK_RING_SIZE) indx = 0;
        blk_ring -> req_prod = indx;
      }
      while (blk_ring -> resp_prod != resp_cons) {
        MB to make sure blk_ring -> ring[resp_cons].resp is valid
        read response from blk_ring -> ring[resp_cons].resp
        if (++ resp_cons == BLK_RING_SIZE) resp_cons = 0;
      }

When you have placed some requests in the ring and have incremented req_prod, do:

      op.cmd = BLOCK_IO_OP_SIGNAL;
      rc = HYPERVISOR_block_io_op (&op);
      if (rc != 0) oz_knl_printk ("oz_dev_xendisk startreq: error %d signalling\n", rc);
The hypervisor will signal the BLKDEV event as each request completes.

My driver is in oz_dev_xendisk.c. There are probably just a few routines of general interest:


Network IO

Like disk IO, network IO is performed by placing requests in a ring buffer then signalling the hypervisor to process them. The hypervisor then replaces the requests with responses and signals their presense by calling your asynchronous event handler with the EVENT_NET bit set in events.

Unlike disk IO, however, there is a separate ring for each virtual network interface your domain can access, and the receive and transmit rings are separated. This makes sense from the standpoint that if you have one very active interface and one relatively inactive one, you wouldn't want requests from the inactive interface interfering with requests from the active one and vice versa.

Each interface operates independently. So you must set up a probing loop to check all interfaces. There is a symbol MAX_DOMAIN_VIFS set up that you can use to terminate your probing loop. The loop should do a NETOP_RESET_RINGS call followed by a NETOP_GET_VIF_INFO and accept the device if both calls are successful (return zero status). You can retrieve the 6-byte ethernet virtual hardware address from element netop.u.get_vif_info and build your OS-dependent device table from there.

You will need a pagetable entry for each virtual device, to map its ring buffer. The ring buffer's machine pagenumber is returned in netop.u.get_vif_info.ring_mfn after each probe; map an unused pagetable entry to this page using HYPERVISOR_mmu_update or HYPERVISOR_update_va_mapping.

You will also need one page per receive queued to a virtual device. So if you intend to keep 10 receives per virtual device in the receive ring at all times, you will need to allocate 10 pages per virtual device. Each of these pages must have exactly one read/write pagetable entry pointing to them or Xen will reject the entry. I think this is because Xen takes it away from you when you queue the request then may give you back a different page when the request completes. So don't assume you get the same page back. But Xen remaps the new page to your same old pagetable entry, so you don't really know the difference. For OZONE, I simply give it the initial mapping pagetable entry that Xen provided on startup.

The rings work pretty much like the disk queues, except that they are split up into separate rings. There is also another index you must contend with, rx_event or tx_event. With these indices, you tell Xen how often to deliver the EVENT_NET event. The event is delivered when tx_resp_prod is incremented and becomes equal to tx_event (likewise for the rx indices). Suffice it to say that, if you have pending requests, you want to be sure that at some point, you set xx_event greater than xx_resp_prod to be sure Xen will queue an event.

Anyway, the conceptual order of the indices are (mod XX_RING_SIZE):

      xx_resp_cons <= xx_resp_prod <= xx_event <= xx_req_prod
Be sure to include memory-barriers where appropriate, as Xen may be munching away on your ring with another CPU whilst you are inserting or removing items. The addition of the xx_event indices complicates things a bit as you don't want to miss an event delivery.

When you place new requests in either the transmit or receive ring, you must call HYPERVISOR_net_op with a function code of NETOP_PUSH_BUFFERS to tell Xen to start working on your new requests.

My driver is in oz_dev_xenetwork.c.


As of Aug 8, 2004, it boots, starts my cli (shell) and runs the startup script. Now I have to implement the network virtual device driver and I 'should' be able to telnet to it.

Here is a list of my Xen-specific source files so far (updated Aug 16, 2004):


Startup log (from Emmy):
 __  __            _   ____  
 \ \/ /___ _ __   / | |___ \ 
  \  // _ \ '_ \  | |   __) |
  /  \  __/ | | | | |_ / __/ 
 /_/\_\___|_| |_| |_(_)_____|
                             
 http://www.cl.cam.ac.uk/netos/xen
 University of Cambridge Computer Laboratory

 Xen version 1.2 (m@n.n) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) Thu Aug 
5 15:44:10 EDT 2004

Initialised all memory on a 320MB machine
Reading BIOS drive-info tables at 0x9fd90 and 0x9fda0
CPU0: Before vendor init, caps: 0000aa19 00000000 00000000, vendor = 0
CPU caps: 0000aa19 00000000 00000000 00000000
Initialising domains
Initialising schedulers
Initializing CPU#0
Detected 2.383 MHz processor.
Found and enabled local APIC!
CPU0: Before vendor init, caps: 0000aa19 00000000 00000000, vendor = 0
CPU caps: 0000aa19 00000000 00000000 00000000
CPU0 booted
SMP motherboard not detected.
Emmy_X86::lapicwrite: illegal write register 0x030, data 00FB00EF, eip 0808:FC623000
Emmy_X86::lapicwrite: illegal write register 0x020, data 0000000F, eip 0808:FC623000
enabled ExtINT on CPU#0
Emmy_X86::lapicwrite: illegal write register 0x280, data 00000000, eip 0808:FC623000
ESR value before enabling vector: 00000000
Emmy_X86::lapicwrite: illegal write register 0x280, data 00000000, eip 0808:FC623000
ESR value after enabling vector: 00000000
Using local APIC timer interrupts.
Calibrating APIC timer for CPU0...
..... CPU speed is 2.3787 MHz.
..... Bus speed is 1.1781 MHz.
..... bus_scale = 0x00000134
ACT: Initialising Accurate timers
Time init:
.... System Time: 10615906ns
.... cpu_freq:    00000000:00245E78
.... scale:       000001A3:8DFA5203
.... Wall Clock:  1092057892s 0us
Start schedulers
PCI: PCI BIOS revision 2.10 entry at 0xf33b0, last bus=0
PCI: Probing PCI hardware
PCI: device 00:00.0 has unknown header type 7f, ignoring.
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx
hda: xenodisk, ATA DISK drive
hdb: hda3, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: 6640704 sectors (3400 MB), CHS=6588/16/63 PIO (slow!)
hdb: 417690 sectors (214 MB), CHS=442/15/63 PIO (slow!)
SCSI subsystem driver Revision: 1.00
Red Hat/Adaptec aacraid driver (1.1.2 Aug  3 2004 09:37:26)
Device dummy opened and ready for use.
DOM0: Guest OS virtual load address is c0000000
DOM0: Guest OS virtual stack address is c3fee000
DOM0: oz_hwxen_start: initializing as domain 0
DOM0: (file=memory.c, line=331) Page 0558f000 bad type/count (02000000!=01000000) cnt=2
DOM0: oz_hwxen_start*: initial event 0x102
DOM0: oz_hwxen_start*: initial events_mask 0x0
DOM0: oz_hwxen_start: CPU frequency 2383480 Hz
DOM0: oz_hwxen_event_vbd_upd*:
DOM0: oz_hwxen_start: boot time 2004-08-09@13:24:56.1475021z
DOM0: oz_ldr_set: parameter signature cannot be changed
DOM0: Copyright (C) 2001,2002,2003,2004  Mike Rieker, Beverly, MA USA
DOM0: Version 2004-01-03, OZONE comes with ABSOLUTELY NO WARRANTY
DOM0: EXPECT it to FAIL when someone's HeALTh or PROpeRTy is at RISk
DOM0: 
DOM0: oz_knl_boot_firstcpu:
DOM0:             total number of cpus: 1
DOM0:             page size (in bytes): 4096
DOM0:   total pages of physical memory: 0x4000 (64 Megabytes)
DOM0:      system base virtual address: 0xc0000000
DOM0:        system page table entries: 0x10000 (256 Megabytes)
DOM0:      initial non-paged pool size: 0x400000 (4096 Kilobytes)
DOM0:             first free virt page: 0xC4000
DOM0:             first free phys page: 0xC0
DOM0: 
DOM0: oz_knl_debug 0: initialized (cb 0xc0052940, cp 0x0, dc 0xc0079860)
DOM0: oz_knl_boot_firstcpu: initializing physical memory
DOM0: oz_knl_phymem_init: cache modulo: L1 1 page (4 K), L2 1 page (4 K)
DOM0: oz_knl_phymem_init: 1296 pages required for phys mem state table and non-paged pool
DOM0: oz_hw_pool_init: 0x510 pages, ppage 0x3AF0, vpage 0xC3AF0
DOM0: oz_knl_phymem_init: physical memory state array at vaddr 0xc3af0000, phypage 0x3AF0
DOM0: oz_knl_phymem_init: initial non-paged pool size 4282432 (4182 K), base 0xc3bea7c0
DOM0: oz_knl_phymem_init: there are 0x3A30 free pages left (58 Meg)
DOM0: oz_knl_boot_firstcpu: initializing modules
DOM0: oz_knl_idno_init: max 256 at 0xc3beb0a8
DOM0: oz_knl_boot_firstcpu: creating system process
DOM0: oz_knl_user_create: user OZ_Startup logged on at 2004-08-09@13:24:57.7618505z
DOM0: oz_knl_thread_cpuinit: cpu 0 initialization complete
DOM0: oz_knl_boot_firstcpu: defining logical names
DOM0: oz_knl_boot_firstcpu: starting device drivers
DOM0: oz_dev_timer_init
DOM0: oz_dev_vdfs_init (oz_dfs)
DOM0: oz_dev_knlcons_init
DOM0: oz_dev_xendisk_init
DOM0: oz_dev_xendisk: xenhd_0 totalblocks 6640704
DOM0: oz_dev_xendisk: xenhd_1 totalblocks 417690
DOM0: oz_knl_boot_firstcpu: device driver init complete
DOM0:   console _console1: console via oz_hw_putcon and oz_hw_getcon
DOM0:   xenhd_0 _disk1   : virtual hardisk 0x300
DOM0:   xenhd_1 _disk2   : virtual hardisk 0x340
DOM0:    oz_dfs _fs1     : init and mount template
DOM0:     timer _timer1  : generic timer
DOM0: oz_knl_boot_firstcpu: creating startup process
DOM0: oz_knl_startup: mounting xenhd_1 via oz_dfs
DOM0: oz_dev_dfs: volume oz_dfs mounted at 2004-08-09@13:11:17.8527305z was not dismounted
DOM0: oz_hw_process_initctx*: ppdsa C01C0000, ppdma 01750000
DOM0: (file=memory.c, line=331) Page 01750000 bad type/count (02000000!=01000000) cnt=1
DOM0: oz_hwxen_pte_write*:  pin ma 01751C07 for vpn 00400 (oldpte 00000000)
DOM0: 
DOM0: *** Reading and validating home block
DOM0: 
DOM0: *** Opening sacred files
DOM0: 
DOM0: *** Reading bitmaps
DOM0: 
DOM0: *** Scanning file headers
DOM0: 
DOM0: *** Checking extension header links
DOM0: 
DOM0: *** Checking directories
DOM0: 
DOM0: *** Writing bitmaps
DOM0: oz_hwxen_pte_write*: unpin ma 01751C67 for vpn 00400 (newpte 00000000)
DOM0: oz_hwxen_pinpdpage*: unpin ma 01750000
DOM0: (file=memory.c, line=367) Bad page type/domain (dom=0) (type 33554432 != expected 16777216)
DOM0: oz_hw_kstack_delete*: kernel stack depth 1864 (0x748)
DOM0: oz_knl_startup: volume mounted on device xenhd_1.oz_dfs
DOM0:   OZ_SYSTEM_DIRECTORY (kernel) (ref:1) (table)
DOM0:     OZ_DEFAULT_TBL (kernel) (ref:1) = 'OZ_PROCESS_TABLE' 'OZ_PARENT_TABLE' 'OZ_JOB_TABLE' 'OZ
_USER_TABLE' 'OZ_SYSTEM_TABLE'
DOM0:     OZ_SYSTEM_TABLE (kernel) (ref:1) (table)
DOM0:       OZ_DEFAULT_DIR (kernel) (ref:0) = 'xenhd_1.oz_dfs:/ozone/binaries/' (terminal)
DOM0:       OZ_IMAGE_DIR (kernel) (ref:0) = 'xenhd_1.oz_dfs:/ozone/binaries/' (terminal)
DOM0:       OZ_LOAD_DEV (kernel) (ref:0) = 'xenhd_1' (terminal)
DOM0:       OZ_LOAD_DIR (kernel) (ref:0) = 'xenhd_1.oz_dfs:/ozone/binaries/' (terminal)
DOM0:       OZ_LOAD_FS (kernel) (ref:0) = 'xenhd_1.oz_dfs:' (terminal)
DOM0:       OZ_SYSTEM_PROCESS (nosupersede) (nooutermode) (kernel) (ref:1) = 'C3BECD28:process' (ob
ject)
DOM0: oz_knl_startup: loading kernel image (oz_kernel_xen.elf) symbol table
DOM0: oz_knl_startup: spawning startup process
DOM0: oz_hw_process_initctx*: ppdsa C02C9000, ppdma 01859000
DOM0: (file=memory.c, line=331) Page 01859000 bad type/count (02000000!=01000000) cnt=1
DOM0: oz_hwxen_pte_write*:  pin ma 0185AC07 for vpn 00400 (oldpte 00000000)
DOM0: oz_knl_startup: startup process spawned
DOM0: oz_hwxen_pte_write*:  pin ma 0195DC07 for vpn 006FF (oldpte 00000000)
DOM0: oz_hwxen_pte_write*:  pin ma 01961C07 for vpn 00420 (oldpte 00000000)
DOM0: oz_hwxen_pte_write*:  pin ma 01970C07 for vpn 00402 (oldpte 00000000)
DOM0: params: executing startup procedure ...
DOM0: #
DOM0: # Define oz_cli's external commands
DOM0: #
DOM0: create logical table -kernel OZ_SYSTEM_DIRECTORY%OZ_CLI_TABLES
DOM0: create logical name  -kernel OZ_CLI_TABLES%cli       oz_cli.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%cat       oz_util_cat.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%copy      oz_util_copy.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%crash     oz_util_crash.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%credir    oz_util_credir.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%dd        oz_util_dd.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%debug     oz_util_debug.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%delete    oz_util_delete.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%dir       oz_util_dir.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%dism      oz_util_dismount.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%dump      oz_util_dump.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%edt       edt.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%elfconv   oz_util_elfconv.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%format    oz_util_diskfmt.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%gunzip    oz_util_gzip.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%gzip      oz_util_gzip.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%init      oz_util_init.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%ip        oz_util_ip.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%ldelf     oz_util_ldelf32.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%make      oz_util_make.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%mount     oz_util_mount.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%partition oz_util_partition.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%purge     oz_util_delete.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%rename    oz_util_copy.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%scsi      oz_util_scsi.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%shutdown  oz_util_shutdown.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%sort      oz_util_sort.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%tailf     oz_util_tailf.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%telnet    oz_util_telnet.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%top       oz_util_top.elf
DOM0: create logical name  -kernel OZ_CLI_TABLES%type      oz_util_cat.elf
DOM0: #
DOM0: #  Create OZ_ROOT_DIR logical name (parent of kernel image directory)
DOM0: #
DOM0: create symbol -string def_dir 'oz_lnm_string (oz_lnm_lookup ("OZ_DEFAULT_DIR", "user"), 0)'
DOM0: create symbol -string load_dir 'oz_lnm_string (oz_lnm_lookup ("OZ_SYSTEM_TABLE%OZ_LOAD_DIR", 
"kernel"), 0)'
DOM0: set default {load_dir}
DOM0: set default ../
DOM0: create symbol -string root_dir 'oz_lnm_string (oz_lnm_lookup ("OZ_DEFAULT_DIR", "user"), 0)'
DOM0: set default {def_dir}
DOM0: create logical name -kernel OZ_SYSTEM_TABLE%OZ_ROOT_DIR -terminal {root_dir}
DOM0: #
DOM0: # Add current directory to image path
DOM0: #
DOM0: create logical name OZ_SYSTEM_TABLE%OZ_IMAGE_DIR OZ_DEFAULT_DIR: -copy OZ_IMAGE_DIR
DOM0: #
DOM0: # Set up timezone
DOM0: #
DOM0: create logical name -kernel OZ_SYSTEM_TABLE%OZ_TIMEZONE_DIR {root_dir}timezones/
DOM0: set timezone EST5EDT
DOM0: oz_cli: error 28 setting timezone to EST5EDT
DOM0: #
DOM0: # Define image used to log in and the password file
DOM0: #
DOM0: create logical name -kernel OZ_SYSTEM_TABLE%OZ_PASSWORD_FILE -terminal {root_dir}startup/pass
word.dat
DOM0: create logical name -kernel OZ_SYSTEM_TABLE%OZ_LOGON_IMAGE   oz_util_logon.elf
DOM0: create logical name OZ_SYSTEM_TABLE%OZ_UTIL_LOGON_MSG "OZONE backup server system" "Authorize
d access only"
DOM0: #
DOM0: # Declare debugger executable
DOM0: # To activate debugger for a program, use -debug option before the command name
DOM0: #
DOM0: create logical name OZ_SYSTEM_TABLE%OZ_DEBUG_IMAGE oz_util_debug.elf
DOM0: oz_hwxen_pte_write*: unpin ma 0185AC67 for vpn 00400 (newpte 00000000)
DOM0: oz_hwxen_pte_write*: unpin ma 01970C67 for vpn 00402 (newpte 00000000)
DOM0: oz_hwxen_pte_write*: unpin ma 01961C67 for vpn 00420 (newpte 00000000)
DOM0: oz_hwxen_pte_write*: unpin ma 0195DC67 for vpn 006FF (newpte 00000000)
DOM0: oz_hwxen_pinpdpage*: unpin ma 01859000
DOM0: (file=memory.c, line=367) Bad page type/domain (dom=0) (type 33554432 != expected 16777216)
DOM0: oz_hw_kstack_delete*: kernel stack depth 2324 (0x914)

Just for grins, I ran it on my test PC barebones style, and it just worked! Here are the config params I had to set up in my /etc/xc/vm3 file:
      image = "/root/oz_kernel_xen.gz"
      ramdisk = ""
      mem_size = 32
      vbd_list = [ ('phy:hdb3', 'hda', 'w') ]
      vbd_expert = 1
      cmdline_root = "load_device=xenhd_0"
The commands used to start it were:
    xen_nat_enable
    xen_read_console &
    xc_dom_create.py -D vmid=3 -f /etc/xc/vm3
And the output was:
[root@xenophilia xc]# xc_dom_create.py -D vmid=3 -f /etc/xc/vm3
Parsing config file '/etc/xc/vm3'
VM image           : "/root/oz_kernel_xen.gz"
VM ramdisk         : ""
VM memory (MB)     : "32"
VM IP address(es)  : "192.168.0.154"
VM block device(s) : "phy:hdb3,hda,w"
VM cmdline         : " load_device=xenhd_0 "
Warning: one or more hard disk extents
         writeable by one domain are also readable by another.
[7] oz_hwxen_start: initializing as domain 7
[7] oz_hwxen_start*: initial event 0x102
[7] oz_hwxen_start*: initial events_mask 0x0
[7] oz_hwxen_start: CPU frequency 350800800 Hz
VM started in domain 7
[7] oz_hwxen_event_vbd_upd*:
[7] oz_hwxen_start: boot time 2004-08-09@16:42:54.3775346z
[7] oz_ldr_set: parameter signature cannot be changed
[7] Copyright (C) 2001,2002,2003,2004  Mike Rieker, Beverly, MA USA
[7] Version 2004-01-03, OZONE comes with ABSOLUTELY NO WARRANTY
[root@xenophilia xc]# [7] EXPECT it to FAIL when someone's HeALTh or PROpeRTy is at RISk
[7]
[7] oz_knl_boot_firstcpu:
[7]             total number of cpus: 1
[7]             page size (in bytes): 4096
[7]   total pages of physical memory: 0x2000 (32 Megabytes)
[7]      system base virtual address: 0xc0000000
[7]        system page table entries: 0x10000 (256 Megabytes)
[7]      initial non-paged pool size: 0x400000 (4096 Kilobytes)
[7]             first free virt page: 0xC2000
[7]             first free phys page: 0xC0
[7]
[7] oz_knl_debug 0: initialized (cb 0xc0052940, cp 0x0, dc 0xc0079860)
[7] oz_knl_boot_firstcpu: initializing physical memory
[7] oz_knl_phymem_init: cache modulo: L1 1 page (4 K), L2 1 page (4 K)
[7] oz_knl_phymem_init: 1160 pages required for phys mem state table and non-paged pool
[7] oz_hw_pool_init: 0x488 pages, ppage 0x1B78, vpage 0xC1B78
[7] oz_knl_phymem_init: physical memory state array at vaddr 0xc1b78000, phypage 0x1B78
[7] oz_knl_phymem_init: initial non-paged pool size 4273184 (4173 K), base 0xc1becbe0
[7] oz_knl_phymem_init: there are 0x1AB8 free pages left (26 Meg)
[7] oz_knl_boot_firstcpu: initializing modules
[7] oz_knl_idno_init: max 256 at 0xc1bed4c8
[7] oz_knl_boot_firstcpu: creating system process
[7] oz_knl_user_create: user OZ_Startup logged on at 2004-08-09@16:42:54.5777125z
[7] oz_knl_thread_cpuinit: cpu 0 initialization complete
[7] oz_knl_boot_firstcpu: defining logical names
[7] oz_knl_boot_firstcpu: starting device drivers
[7] oz_dev_timer_init
[7] oz_dev_vdfs_init (oz_dfs)
[7] oz_dev_knlcons_init
[7] oz_dev_xendisk_init
[7] oz_dev_xendisk: xenhd_0 totalblocks 417690
[7] oz_knl_boot_firstcpu: device driver init complete
[7]   console _console1: console via oz_hw_putcon and oz_hw_getcon
[7]   xenhd_0 _disk1   : virtual hardisk 0x300
[7]    oz_dfs _fs1     : init and mount template
[7]     timer _timer1  : generic timer
[7] oz_knl_boot_firstcpu: creating startup process
[7] oz_knl_startup: mounting xenhd_0 via oz_dfs
[7] oz_dev_dfs: volume oz_dfs mounted at 2004-08-09@13:25:04.2861632z was not dismounted
[7] oz_hw_process_initctx*: ppdsa C01C0000, ppdma 0F3EF000
[7] oz_hwxen_pte_write*:  pin ma 0F3EEC07 for vpn 00400 (oldpte 00000000)
[7]
[7] *** Reading and validating home block
[7]
[7] *** Opening sacred files
[7]
[7] *** Reading bitmaps
[7]
[7] *** Scanning file headers
[7]
[7] *** Checking extension header links
[7]
[7] *** Checking directories
[7]
[7] *** Writing bitmaps
[7] oz_hwxen_pte_write*: unpin ma 0F3EEC67 for vpn 00400 (newpte 00000000)
[7] oz_hwxen_pinpdpage*: unpin ma 0F3EF000
[7] oz_hw_kstack_delete*: kernel stack depth 1740 (0x6cc)
[7] oz_knl_startup: volume mounted on device xenhd_0.oz_dfs
[7]   OZ_SYSTEM_DIRECTORY (kernel) (ref:1) (table)
[7]     OZ_DEFAULT_TBL (kernel) (ref:1) = 'OZ_PROCESS_TABLE' 'OZ_PARENT_TABLE' 'OZ_JOB_TABLE' 'OZ_USER_TABLE' 'OZ_SYSTEM_TABLE'
[7]     OZ_SYSTEM_TABLE (kernel) (ref:1) (table)
[7]       OZ_DEFAULT_DIR (kernel) (ref:0) = 'xenhd_0.oz_dfs:/ozone/binaries/' (terminal)
[7]       OZ_IMAGE_DIR (kernel) (ref:0) = 'xenhd_0.oz_dfs:/ozone/binaries/' (terminal)
[7]       OZ_LOAD_DEV (kernel) (ref:0) = 'xenhd_0' (terminal)
[7]       OZ_LOAD_DIR (kernel) (ref:0) = 'xenhd_0.oz_dfs:/ozone/binaries/' (terminal)
[7]       OZ_LOAD_FS (kernel) (ref:0) = 'xenhd_0.oz_dfs:' (terminal)
[7]       OZ_SYSTEM_PROCESS (nosupersede) (nooutermode) (kernel) (ref:1) = 'C1BEF148:process' (object)
[7] oz_knl_startup: loading kernel image (oz_kernel_xen.elf) symbol table
[7] oz_knl_startup: spawning startup process
[7] oz_hw_process_initctx*: ppdsa C02C9000, ppdma 0F2E6000
[7] oz_hwxen_pte_write*:  pin ma 0F2E5C07 for vpn 00400 (oldpte 00000000)
[7] oz_knl_startup: startup process spawned
[7] oz_hwxen_pte_write*:  pin ma 0F1E2C07 for vpn 006FF (oldpte 00000000)
[7] oz_hwxen_pte_write*:  pin ma 0F1DEC07 for vpn 00420 (oldpte 00000000)
[7] oz_hwxen_pte_write*:  pin ma 0F1CFC07 for vpn 00402 (oldpte 00000000)
[7] params: executing startup procedure ...
[7] #
[7] # Define oz_cli's external commands
[7] #
[7] create logical table -kernel OZ_SYSTEM_DIRECTORY%OZ_CLI_TABLES
[7] create logical name  -kernel OZ_CLI_TABLES%cli       oz_cli.elf
[7] create logical name  -kernel OZ_CLI_TABLES%cat       oz_util_cat.elf
[7] create logical name  -kernel OZ_CLI_TABLES%copy      oz_util_copy.elf
[7] create logical name  -kernel OZ_CLI_TABLES%crash     oz_util_crash.elf
[7] create logical name  -kernel OZ_CLI_TABLES%credir    oz_util_credir.elf
[7] create logical name  -kernel OZ_CLI_TABLES%dd        oz_util_dd.elf
[7] create logical name  -kernel OZ_CLI_TABLES%debug     oz_util_debug.elf
[7] create logical name  -kernel OZ_CLI_TABLES%delete    oz_util_delete.elf
[7] create logical name  -kernel OZ_CLI_TABLES%dir       oz_util_dir.elf
[7] create logical name  -kernel OZ_CLI_TABLES%dism      oz_util_dismount.elf
[7] create logical name  -kernel OZ_CLI_TABLES%dump      oz_util_dump.elf
[7] create logical name  -kernel OZ_CLI_TABLES%edt       edt.elf
[7] create logical name  -kernel OZ_CLI_TABLES%elfconv   oz_util_elfconv.elf
[7] create logical name  -kernel OZ_CLI_TABLES%format    oz_util_diskfmt.elf
[7] create logical name  -kernel OZ_CLI_TABLES%gunzip    oz_util_gzip.elf
[7] create logical name  -kernel OZ_CLI_TABLES%gzip      oz_util_gzip.elf
[7] create logical name  -kernel OZ_CLI_TABLES%init      oz_util_init.elf
[7] create logical name  -kernel OZ_CLI_TABLES%ip        oz_util_ip.elf
[7] create logical name  -kernel OZ_CLI_TABLES%ldelf     oz_util_ldelf32.elf
[7] create logical name  -kernel OZ_CLI_TABLES%make      oz_util_make.elf
[7] create logical name  -kernel OZ_CLI_TABLES%mount     oz_util_mount.elf
[7] create logical name  -kernel OZ_CLI_TABLES%partition oz_util_partition.elf
[7] create logical name  -kernel OZ_CLI_TABLES%purge     oz_util_delete.elf
[7] create logical name  -kernel OZ_CLI_TABLES%rename    oz_util_copy.elf
[7] create logical name  -kernel OZ_CLI_TABLES%scsi      oz_util_scsi.elf
[7] create logical name  -kernel OZ_CLI_TABLES%shutdown  oz_util_shutdown.elf
[7] create logical name  -kernel OZ_CLI_TABLES%sort      oz_util_sort.elf
[7] create logical name  -kernel OZ_CLI_TABLES%tailf     oz_util_tailf.elf
[7] create logical name  -kernel OZ_CLI_TABLES%telnet    oz_util_telnet.elf
[7] create logical name  -kernel OZ_CLI_TABLES%top       oz_util_top.elf
[7] create logical name  -kernel OZ_CLI_TABLES%type      oz_util_cat.elf
[7] #
[7] #  Create OZ_ROOT_DIR logical name (parent of kernel image directory)
[7] #
[7] create symbol -string def_dir 'oz_lnm_string (oz_lnm_lookup ("OZ_DEFAULT_DIR", "user"), 0)'
[7] create symbol -string load_dir 'oz_lnm_string (oz_lnm_lookup ("OZ_SYSTEM_TABLE%OZ_LOAD_DIR", "kernel"), 0)'
[7] set default {load_dir}
[7] set default ../
[7] create symbol -string root_dir 'oz_lnm_string (oz_lnm_lookup ("OZ_DEFAULT_DIR", "user"), 0)'
[7] set default {def_dir}
[7] create logical name -kernel OZ_SYSTEM_TABLE%OZ_ROOT_DIR -terminal {root_dir}
[7] #
[7] # Add current directory to image path
[7] #
[7] create logical name OZ_SYSTEM_TABLE%OZ_IMAGE_DIR OZ_DEFAULT_DIR: -copy OZ_IMAGE_DIR
[7] #
[7] # Set up timezone
[7] #
[7] create logical name -kernel OZ_SYSTEM_TABLE%OZ_TIMEZONE_DIR {root_dir}timezones/
[7] set timezone EST5EDT
[7] oz_cli: error 28 setting timezone to EST5EDT
[7] #
[7] # Define image used to log in and the password file
[7] #
[7] create logical name -kernel OZ_SYSTEM_TABLE%OZ_PASSWORD_FILE -terminal {root_dir}startup/password.dat
[7] create logical name -kernel OZ_SYSTEM_TABLE%OZ_LOGON_IMAGE   oz_util_logon.elf
[7] create logical name OZ_SYSTEM_TABLE%OZ_UTIL_LOGON_MSG "OZONE backup server system" "Authorized access only"
[7] #
[7] # Declare debugger executable
[7] # To activate debugger for a program, use -debug option before the command name
[7] #
[7] create logical name OZ_SYSTEM_TABLE%OZ_DEBUG_IMAGE oz_util_debug.elf
[7] oz_hwxen_pte_write*: unpin ma 0F2E5C67 for vpn 00400 (newpte 00000000)
[7] oz_hwxen_pte_write*: unpin ma 0F1CFC67 for vpn 00402 (newpte 00000000)
[7] oz_hwxen_pte_write*: unpin ma 0F1DEC67 for vpn 00420 (newpte 00000000)
[7] oz_hwxen_pte_write*: unpin ma 0F1E2C67 for vpn 006FF (newpte 00000000)
[7] oz_hwxen_pinpdpage*: unpin ma 0F2E6000
[7] oz_hw_kstack_delete*: kernel stack depth 2324 (0x914)

[root@xenophilia xc]#


August 12, 2004: I can ping!

[snip of startup stuff]

Here is the driver probing the ethernet devices, followed by a partial listing of devices. It names the device xenet_0:

[63] oz_dev_xenetwork_init
[63] oz_dev_xenetwork_init: found vif 0, address AA-00-00-AB-29-56
[63] oz_dev_xenetwork_init: error -22 resetting vif 1 rings
[63] oz_dev_xenetwork_init: error -22 resetting vif 2 rings
[63] oz_dev_xenetwork_init: error -22 resetting vif 3 rings
[63] oz_dev_xenetwork_init: error -22 resetting vif 4 rings
[63] oz_dev_xenetwork_init: error -22 resetting vif 5 rings
[63] oz_dev_xenetwork_init: error -22 resetting vif 6 rings
[63] oz_dev_xenetwork_init: error -22 resetting vif 7 rings
[63] oz_knl_boot_firstcpu: device driver init complete
[63]     console _console1: console via oz_hw_putcon and oz_hw_getcon
[63]     ramdisk _disk1   : mount [K/M]
[63]     xenhd_0 _disk2   : virtual hardisk 0x300
[63]   etherloop _ether1  : ethernet loopback
[63]     xenet_0 _ether2  : virtual ethernet AA-00-00-AB-29-56
[63]      oz_dfs _fs1     : init and mount template
[63]      oz_dpt _fs2     : mount template

[snip of more startup stuff]

Here are my equivalent of ifconfig, enabling the device and assigning its ip address:

[63] ip hw add xenet_0
[63] oz_dev_ip: device xenet_0, hw addr AA-00-00-AB-29-56, enabled, mtu 1500
[63] ip hw ipam add xenet_0 192.168.0.154 192.168.0.0 255.255.255.0
[63] #

Here is my ping command (as part of the startup script) pinging the router:

[63] ip ping 192.168.0.1
[63] ip: pinging 192.168.0.1, ip packet length 40, icmp length 16
[63] ip: error 19 enabling ctrl-C detection
[63] 192.168.0.1  seq      0  ttl 254  time 0.0027021
[63] 192.168.0.1  seq      1  ttl 254  time 0.0006668
[63] 192.168.0.1  seq      2  ttl 254  time 0.0006860
[63] 192.168.0.1  seq      3  ttl 254  time 0.0006731
[63] 192.168.0.1  seq      4  ttl 254  time 0.0006683
[63] 192.168.0.1  seq      5  ttl 254  time 0.0006555
[63] 192.168.0.1  seq      6  ttl 254  time 0.0006670
[63] 192.168.0.1  seq      7  ttl 254  time 0.0006726
[63] 192.168.0.1  seq      8  ttl 254  time 0.0006492
[63] 192.168.0.1  seq      9  ttl 254  time 0.0006788
[63] 192.168.0.1  seq     10  ttl 254  time 0.0006455
[63] 192.168.0.1  seq     11  ttl 254  time 0.0006634
[63] 192.168.0.1  seq     12  ttl 254  time 0.0006576
[63] 192.168.0.1  seq     13  ttl 254  time 0.0006416

August 13, 2004: And now I can telnet in!


Conclusions

So basically, at this point, I consider it working, though it has not been beat on. It was about as difficult as I expected. From most difficult to least of problems I remember:

  1. Process initialization and termination code took a while to get working. Most troubles were due to me putting pagetables at low end of virtual memory. The 486 and AXP ports both have them at the high end. So in those cases, on termination, all data pages got freed first, then the pagetable pages. With the Xen port, it tries to wipe the pagetable pages first, so the Xen cleanup code has to signal to the higher-level code which datapages need to be free first before a particular pagetable page can be released. All the higher-level code was in place for doing that, I just didn't think of it when I wrote the Xen stuff.
  2. Sorting out how the event masks worked. It was surprising that it cleared both the master enable and the individual enables. There is no 'iret and restore masks' call to the hypervisor (maybe it would be too slow but how slow compared to the alternative)? The traps (like pagefault) have the ability to clear the event enable, but don't tell you the previous state.
  3. Straightening out the initial pagetable mapping the way I wanted and the self-referencing pagedirectory pointer mess
  4. Ethernet events are triggered when the xresp_prod equals the xx_event, not when it passes (ie, greater than) it.
I started this project about July 13th, it's now August 13th. It took me about two months to port OZONE to the Alpha, so from that perspective, this was easier.

All in all, I say to Xen developers, nicely done! Even though my guest OS crashed and crashed during porting, Xen kept running and all I needed to do was reboot my VM. It pretty much works as advertised. Also, doing this project proved Emmy's worth as a development tool (as well as finding a few bugs)!


Changes to OZONE's hardware-independent layer

I made these changes to the base OZONE code during the Xen porting effort:

This one file is common to both the standard 486 hardware layer and the Xen pseudo-hardware layer: