[lowrisc-dev] RFC: Hardware implemented Hypervisor with device abstraction

Reinoud Zandijk reinoud at NetBSD.org
Thu Apr 7 13:21:25 BST 2016


Dear folks,

I'd like to present you all with a proposal for a Hardware implemented
Hypervisor. Its main benefits are that its virtually unhackable and has a
completely integrated debug facility.

Its not 100% done of course but the idea is worked out sufficiently enough to
share and RFC for the 1st round.

Hope to hear from you,
Reinoud Zandijk

-------------- next part --------------
Work document on lowRISC SoC, Hardware Hypervisor structure proposal, v4
Version 1 March 2016
Author Reinoud Zandijk


Pre-amble
=========

When the word `domain' is used here, it is referring to either an operating
system or a privileged critical subsystem. When the phrase Hypervisor Minion
is used, it can be considered to be renamed and have the role of the Master
Minion :)


Summary
=======

The proposed structure facilitates a hypervisor that is (nearly almost)
implemented in hardware. No software is needed to run on the application CPUs
while the system still has all the benefits of running a hypervisor. It also
as a bonus features an excellent debugging interface as a consequence. The
proposed implementation does have some implications for operating system to
device communication infrastructure.


Changelog
=========


Motivation
==========

Running a hypervisor in software on the Application CPUs has the disadvantage
that the hypervisor is running in a domain of its own on the CPUs, involving
switching contexts on hypercalls and on taking interrupts, polluting the L1
and L2 caches and so on. It also involves keeping a shared hypervisor
administration on the Application CPUs. The resulting lock contention on
accessing the administration can be mitigated somewhat by using a multikernel
design at the cost of taking more IPIs.

It would be nice to try to integrate the various subsystems like
virtualization (trough a hypervisor), the debug infrastructure and security
enforcing into one unit that is invisible to the running OSes other than
providing an HAL.


Detailed design
===============

The proposed hypervisor is based on the assumption that for each hart its
virtualization and its (invasive) debug infrastructure share the same
features: controlling execution on the CPU, reading and writing CPU general
purpose registers, its CSR registers and various internal registers.

To facilitate this access, each hart has (at least conceptionally) a debug bus
for read/write access to its general purpose registers (x1-x31, f0-f31, ...)
and their CPU control registers (CSRs, run/stop/step/watch etc.). All
Application CPU harts `guts' are thus accessible on the debug bus, but to the
hypervisor module only.

Each Application hart is also given a domain_id register, accessible only and
written by the Hypervisor Minion. This register value is used as a key to
identify the domain running on the CPU to the various subsystems it connects
to like the external interrupt controller, for tagging entries in its L1 cache
and for device access checking.


Memory isolation
----------------

For complete isolation and security enforcing, each Application CPU hart's
memory accesses, either trough or bypassing the L2 cache, is to be protected
by a memory segmentation access checker checking all accesses to the physical
memory segments regardless of the CPU mode.

For paged modes the segmentation access checker described in the memory
carving proposal will suffice but for the non-paged machine mode each access
will need to be checked! If the domain is behaving like it ought to do this
will never trap nor have noticeable overhead. The configuration of this
segmentation access checker is considered part of the state of the CPU and
also only accessible to the hypervisor Minion on the debug bus.

Note that the segment checker can only be overruled by the device sections as
defined below.


Hypervisor
----------

As stated, all the hypervisors algorithms run on a separate dedicated Minion
CPU. It is important that this Minion has exclusive access to the debug bus.
Preferably it has a scratchpad memory of its own and a gateway to the L2
cache. It uses this for generic memory scribbling; reserving one memory
segment for its exclusive use to provide tables for various subsystems if
needed.


Booting the SoC
---------------

On booting the SoC, all Application- and Minion-CPUs stay halted and the
FlashROM contents is copied to the Hypervisor Minions memory. The Hypervisor
CPU Minion is then started up allowing it to initialize the board, the SDRAM
controllers and all other subsystems.

Regardless of how the Minions are implemented their program memory is set up
by the Hypervisor Minion and they are released.

All Application CPUs are started by the hypervisor module when needed by
seeding a register set and environment for them to run in and releasing them.


Communication
-------------

The integrated debug and hardware Hypervisor scheme has issues with using
FIFOs as a way of communicating due to FIFOs having a state. This could be
overcome by creating a send and receive space in a CSR range. Messages are
sent or acknowledged by setting CSR bits that atomically streams the piece of
memory in a fire-and-forget way.

As an alternative, a virtio PCI or mmap virtio scheme could be used but its
very tailored to running a virtual environment in a master-slave configuration
due to its buffer design where buffers are declared in the slave memory and
its dependency on the master environment being able to always read or write
the slave memory.

Better would suit a tailored solution, effectively providing a safe way to fix
memory access issues in one go. For this we define channels each with an
unique channel number. Each channel number maps to a septuple defining the
channel characteristics i.e. (domain_id, chan_select) -> (chan_rdbase,
chan_rdlen, chan_wrbase, chan_wrlen, dest_domain:dest_intbit).

We also define the following CSR registers:
* chan_select, a read/writeable CSR registers that selects the channel to
  use.
* chan_intr, a write-only CSR register that generates the interrupt for the
  destination domain marking the specified interrupt bit high.
* chan_rdbase and chan_rdlen, a set of read-only CSR registers that
  specify a range where the current selected channel has its read-only
  section.
* chan_wrbase and chan_wrlen, a set of read-only CSR registers that
  specify a range where the current selected channel has its write-only
  section.

The chan_select CSR register is part of the domain context just as the
domain_id CSR is. Changing the extents while swapped-out will cause a big
problem unless the Hypervisor Minion will auto-fill in the selected
chan_select's data on swap-in.

The channel numbers are free-form since they'll be coupled to a specific
domain.

Accessing memory in the [chan_rdbase, chan_rdbase + chan_rdlen> or in the
[chan_wrbase, chan_wrbase + chan_wrlen> ranges should *overrule* the segment
checker thus allowing inter-segment communication trough the devices removing
the problem with memory sharing though software ought to take `normal' inter
hart memory ordering into consideration since its destination is not known.

There might be more than one channel available to reduce the number of misses
and to allow fast inter-device copying since two devices can be selected and
their memory copied.


Implementing communication
--------------------------

On chan_select writing, the read-only CSR registers defined above are set by
the Hypervisor Minion with the help of say an associative memory as
quick-lookup. On misses in the associative arrays the Hypervisor Minion CPU
may be interrupted to provide the data to the arrays.

Note that the channel scheme is the same for the `normal` Minions as for the
Application CPUs. This allows low-level devices to be implemented on a Minion
and high-level devices to be implemented in domains of their own on the
Application CPUs without the destination can see the difference.

Some care needs to be taken to prevent the read and write ranges to not
overlap the scratchpad and mmio memory ranges of the Minions. To avoid this
the Hypervisor Minion can claim the overlapping section for its internal
administration.

On writing chan_intr, an interrupt will be signaled for the associated
dest_domain:dest_intbit domain using the interrupt bit. If its a permanent
domain, of say a Minion, its flagged and delivered directly according to its
settings. On non-permanent domains, say running on an Application CPU, its
flagged and delivered directly when one of the CPUs is running the domain or
otherwise its noted by the Hypervisor Minion as a possible reason to swap
domains on one of the Application CPUs.


Taking interrupts
---------------------------------

Each CPU gets an interrupt enable, interrupt req and interrupt ack register
just as standard per RISC-V spec. The bit to light up is free to set up by the
initialisation as its the part of the channel setup.


Device discovery and resource management
----------------------------------------

Since all Hypervisor code runs ONLY on the dedicated Hypervisor Minion, its
administration is all in one place with only one CPU. It can even be
implemented as a single threaded application and is stored in a FlashROM that
is SoC or even board-specific. All device and resource management is thus kept
in this one administration.


Each device can be mapped to one (or more) channels. For a simple device
interface, a Minion can reserve two areas of its memory, one read-only for
receiving data and one write-only for sending data and specify an interrupt
pin it wants to see set when there is a notification. It also pics a number
for the channel; this is free to choose since its matched to its domain_id. It
then sends this information to the Hypervisor Minion over its channel.
Alternatively the memory ranges can be requested from the Hypervisor Minion.
This prevents the need for a Minion domain to always have a DRAM segment and
allows multiple small device drivers (channel providers) to share one segment.

When a domain discovers the device, a connection request is issued to the
Hypervisor Minion with the interrupt pin it wants to receive notifications on.

For each domain a list of accessible devices/channels is maintained. Most of
the devices will be implemented on the Minions but some higher level will be
implemented by other domains.

The channel numbers are free-form since they'll be coupled to a specific
domain anyway.


Hardware device access
----------------------

No Application CPU and thus no OS can directly access hardware, all is done
trough the device interface described above. Even a PIC is not needed. The
channel interface described above allows the OS to group interrupts on
interrupt pins it wants clustered. In combination with enable bits allows one
CPU to have a software PIC. Inter-CPU interrupt priority within one OS/domain
might need some more support.

Only Minions are able to access hardware directly trough whatever way they
allowed to. An abstracted interface with basic (low?) level drivers are
exported by them trough the device scheme above. Higher level drivers can be
implemented and provided on the Application CPUs by it importing from the low
level driver and exporting a high level interface for it.


Software implications
---------------------

Each domain running on Application CPUs gets a set of non-contiguous memory
segments assigned to operate in. Even when the Application CPU switches to
machine mode, these segments are still enforced. The only time it can reach
segments that it doesn't own are when communicating with devices its assigned.

Each domain running on an Minion has its piece of scratch memory to operate on
and be given access from zero to say two domains of the DRAM for
buffers/channels. It can then create a device by communicating over the
Hypervisor Minions device exporting pieces of its DRAM for communication
channels.

The protocol used for inter domain communication needs to be defined to allow
it to have a read-only and a write-only section. The read-only section
typically holds the status of the device in the first kilobytes followed by
buffers containing all the data travelling from device to its consumer. The
write-only section typically holds a `shadow' copy of the status in the first
kilobytes followed by buffers containing all the data travelling from consumer
to device. I see no reason why the spaces couldn't map the same range but
providing two separate areas allows for more access control.

Some care needs to be taken on the protocol to separate requests from state.
Even when the read-only and write-only areas are then overlapping, all device
configuration change requests can be done by writing in the matching space of
the write-only area and signalling the device owner. When the change is then
applied and detected in the read-only area it can be seen as acknowledged;
this is not unlike how the virtio configuration system works.


Implementation sketch
---------------------
TBD


Drawbacks
=========

The most obvious drawback is that it isn't a `standard' way of implementing
devices and the hypervisors functionality. But then, if we choose a `standard'
way, we are making use of the hacks they needed to make to circumvent hardware
design issues. In doing so we then don't distinguish ourselves and evolution
stiffens.


Suggestions
===========

There might be place for an ioMMU that can do gather/scatter in stead of the
current two read and write areas declared but would introduce a page walker.

If there are enough Application CPUs, one could simplify and speed up
interrupt delivery by demanding some or each OS/domain to have at least one
CPU tied to it permanently.


Alternatives
============

A traditional Hypervisor without the strict memory separation that is using
techniques originally designed for standard SoC architectures.


Unresolved questions
====================

The speed of interrupt taking on devices running on the Application CPUs:
mapped-in domains ought to take a few cycles only but unmapped-out domains
takes some time on the Hypervisor CPU and the time for a context switch.



More information about the lowrisc-dev mailing list