[lowrisc-dev] LowRISC SoC structure proposal v5

Reinoud Zandijk reinoud at NetBSD.org
Sun Jul 10 11:59:26 BST 2016


Hi folks,

here my v5 proposal for review. Please do read and comment on it :)

With regards,
Reinoud

-------------- next part --------------
Work document on lowRISC SoC structure, v5
Version 6 June 2016
Author Reinoud Zandijk


Pre-amble
=========

When the word `domain' is used here, it is referring to either an operating
system or a privileged critical subsystem.

It assumes a SoC to be split in three sections: general purpose beefy
Application CPUs, a set of specialized Minion CPUs and one dedicated
`hypervisor` Minion.

The Application CPUs can only access the DRAM, the specialized Minion CPUs can
directly talk to hardware and can optionally also access the DRAM. The
`hypervisor` CPU runs the SoC and board firmware only,


Summary
=======

This document proposes an integrated system of memory protection, device
communication, interrupt handling and integrated debugging facilities. It
facilitates running multiple domains that can be as simple as a random number
device, a high-level privileged device abstraction system to a full fledged
operating system.

The proposed structure facilitates virtualisation trough a hypervisor that is
(nearly almost) implemented in hardware: neither the Minions- nor the
Application-CPUs run any software part. It also inherently features an
excellent debugging interface as a consequence. The proposed implementation
does have some implications for operating system to device communication
infrastructure. In contrast to contemporary systems, there is no notion of
running non-virtualized code.


Changelog
=========


Motivation
==========

Running a hypervisor in software on the Application CPUs has the disadvantage
that the hypervisor is running in a domain of its own on the CPUs, involving
switching contexts on hypercalls and on taking interrupts, polluting the L1
and L2 caches and so on. It also involves keeping a shared hypervisor
administration on the Application CPUs. The resulting lock contention on
accessing the administration can be mitigated somewhat by using a multikernel
design at the cost of taking more IPIs.

The proposed structure tries to integrate the various subsystems like
virtualization, the debug infrastructure and the security enforcing into one
unit that is invisible to the running OSes other than providing an HAL.


Detailed design
===============

The proposed structure is based on the fact that for each hart its
virtualization support and its (invasive) debug infrastructure share the same
logic: controlling execution on the CPU, reading and writing CPU general
purpose registers, its CSR registers and various internal registers.

To facilitate this access, each hart has (at least conceptionally) a debug bus
for read/write access to its general purpose registers (x1-x31, f0-f31, ...)
and their CPU control registers (CSRs, run/stop/step/watch etc.). All
Application CPU harts `guts' are thus accessible on the debug bus, but to the
hypervisor CPU only. Note that switching can be sped up by having the hardware
set a bit on every register that has been changed. This allows for only those
registers to be saved that are actually changed. Restoring them back again
always need a full write though.

Each Application hart is also given a domain_id register, accessible only and
written by the hypervisor Minion. This register value is used as a key to
identify the domain running on the CPU to the various subsystems it connects
to as well for tagging entries in its L1 cache and for device access checking.


Memory isolation
----------------

For complete isolation and security enforcing, each Application CPU hart's
memory accesses, either trough or bypassing the L2 cache, is to be protected
by a memory segmentation access checker checking all accesses to the physical
memory segments regardless of the CPU mode.

For paged modes the segmentation access checker described in the memory
carving proposal will suffice but for the non-paged machine mode each access
will need to be checked! If the domain is behaving like it ought to do this
will never trap nor have noticeable overhead. The configuration of this
segmentation access checker is considered part of the state of the CPU and
also only accessible to the hypervisor Minion on the debug bus.

Note that the segment checker can only be overruled by the device sections as
defined below.


hypervisor
----------

As stated, all the hypervisors algorithms run on a separate dedicated Minion
CPU. It is important that this Minion has exclusive access to the debug bus.
Preferably it has a scratchpad memory of its own and a gateway to the L2
cache. It uses this for generic memory scribbling; reserving one memory
segment for its exclusive use to provide tables for various subsystems if
needed.


Booting the SoC
---------------

On booting the SoC, all Application- and Minion-CPUs stay halted and the
FlashROM contents is copied to the hypervisor Minions memory. The hypervisor
CPU Minion is then started up allowing it to initialize the board, the SDRAM
controllers and all other subsystems.

Regardless of how the Minions are implemented their program memory is set up
by the hypervisor Minion and they are released.

All Application CPUs are started by the hypervisor module when needed by
seeding a register set and environment for them to run in and releasing them.


Communication
-------------

The integrated debug and hardware hypervisor scheme has issues with using
FIFOs as a way of communicating due to FIFOs having a state. This could be
overcome by creating a send and receive space in a CSR range. Messages are
sent or acknowledged by setting CSR bits that atomically streams the piece of
memory in a fire-and-forget way.

As an alternative, a virtio PCI or mmap virtio scheme could be used but its
very tailored to running a virtual environment in a master-slave configuration
due to its buffer design where buffers are declared in the slave memory and
its dependency on the master environment being able to always read or write
the slave memory.

Better would suit a tailored solution, effectively providing a safe way to fix
memory access issues in one go and run a virtio alike transport over them. For
this we define channels each with an unique channel number. Each channel
number maps to a septuple defining the channel characteristics i.e.
(domain_id, chan_select) -> (chan_rdbase, chan_rdlen, chan_wrbase, chan_wrlen,
dest_domain:dest_intbit).

We also define the following CSR registers:
* chan_select, a read/writeable CSR registers that selects the channel to
  use.
* chan_intr, a write-only CSR register that generates the interrupt for the
  destination domain marking the specified interrupt bit high.
* chan_rdbase and chan_rdlen, a set of read-only CSR registers that
  specify a range where the current selected channel has its read-only
  section.
* chan_wrbase and chan_wrlen, a set of read-only CSR registers that
  specify a range where the current selected channel has its write-only
  section.

The chan_select CSR register is part of the domain context just as the
domain_id CSR is. Changing the extents while swapped-out will cause a big
problem unless the hypervisor Minion will auto-fill in the selected
chan_select's data on swap-in.

The channel numbers are free-form since they'll be coupled to a specific
domain.

Accessing memory in the [chan_rdbase, chan_rdbase + chan_rdlen> or in the
[chan_wrbase, chan_wrbase + chan_wrlen> ranges should *overrule* the segment
checker thus allowing inter-segment communication trough the devices removing
the problem with memory sharing though software ought to take `normal' inter
hart memory ordering into consideration since its destination is not known.

There might be more than one channel available to reduce the number of misses
and to allow fast inter-device copying since two devices can then be selected
simultaneously and their memory copied.


Implementing communication
--------------------------

On chan_select writing, the read-only CSR registers defined above are set by
the hypervisor Minion with some help of say a SoC wide (few entry) LRU cache
as quick-lookup if it turns out to be beneficial. On misses in LRU the
hypervisor Minion CPU may be interrupted to provide the data to the arrays.

Note that the interface scheme is *the same* for the `normal` Minions as for
the Application CPUs. This allows low-level devices to be implemented on a
Minion and high-level devices to be implemented in domains of their own on the
Application CPUs without the destination having to know the difference.

Some care needs to be taken to prevent the read and write ranges to not
overlap the scratchpad and mmio memory ranges of the Minions. To avoid this
the hypervisor Minion can claim the overlapping section for its internal
administration.

On writing chan_intr, an interrupt will be signaled for the associated
dest_domain:dest_intbit domain using the interrupt bit. If its a permanent
domain, of say a Minion, its flagged and delivered directly according to its
settings. On non-permanent domains, say running on an Application CPU, its
flagged and delivered directly when one of the CPUs is running the domain or
otherwise its noted by the hypervisor Minion as a possible reason to swap
domains on one of the Application CPUs.


Taking interrupts
---------------------------------

Each CPU gets an interrupt enable, interrupt req and interrupt ack register
just as standard per RISC-V spec. The bit to light up on interrupt is free to
set up by the initialisation as its the part of the channel setup.


Device discovery and resource management
----------------------------------------

Since all hypervisor code runs *only* on the dedicated hypervisor Minion, its
administration is all in one place with only one CPU. It can even be
implemented as a single threaded application and is stored in a FlashROM that
is SoC or even board-specific. All device and resource management is thus kept
in this one administration.

Each logical device can be mapped to one (or more) communication channels. For
a simple device interface, a Minion can reserve two areas of its memory, one
read-only for receiving data and one write-only for sending data and specify
an interrupt pin it wants to see set when there is a notification. It also
picks a number for the channel; this is free to choose since its matched to
its domain_id. It then sends this information to the hypervisor Minion over
its channel. Alternatively the memory ranges can be requested from the
hypervisor Minion. This prevents the need for a Minion domain to always have a
DRAM segment and allows multiple small device drivers (channel providers) to
share one segment.

For each domain a list of accessible devices/channels is maintained. Most of
the devices will be implemented on the Minions but some higher level will be
implemented by other domains.

When a domain discovers a device, a connection request can be issued to the
hypervisor Minion with the interrupt pin it wants to receive notifications on.

The channel numbers are free-form since they'll be coupled to a specific
domain anyway.


Hardware device access
----------------------

No Application CPU and thus no OS can directly access hardware, all is done
trough the device interface described above. There is no hardware PIC. The
channel interface described above allows the OS to group interrupts on
interrupt pins it wants clustered. In combination with enable bits allows one
CPU to have an easy software PIC functionality. Inter-CPU interrupt priority
within one OS/domain might need some more support.

Only Minions are able to access hardware directly trough whatever way they
allowed to. An abstracted interface with basic (low?) level drivers are
exported by them trough the device scheme above. Higher level drivers can be
implemented and provided on the Application CPUs as they can communicate to
low level drivers running on Minions and export a high level interface for it.


Software implications
---------------------

Each domain running on Application CPUs gets a set of non-contiguous memory
segments assigned to operate in. Even when the Application CPU switches to
machine mode, these segments are still enforced. The only time it can reach
segments that it doesn't own are when communicating with devices its assigned.

Each domain running on a Minion has its piece of scratch memory to operate on
and be given access from zero to say two domains of the DRAM for
buffers/channels. It can then create a device by communicating over the
hypervisor Minions device exporting pieces of its DRAM for communication
channels.

The protocol used for inter domain communication needs to be defined to allow
it to have a read-only and a write-only section. The read-only section
typically holds the status of the device in the first kilobytes followed by
buffers containing all the data travelling from device to its consumer. The
write-only section typically holds a `shadow' copy of the status in the first
kilobytes followed by buffers containing all the data travelling from consumer
to device. I see no reason why the spaces couldn't map the same range but
providing two separate areas allows for more access control.

Some care needs to be taken on the protocol to separate requests from state.
Even when the read-only and write-only areas are then overlapping, all device
configuration change requests can be done by writing in the matching space of
the write-only area and signalling the device owner. When the change is then
applied and detected in the read-only area it can be seen as acknowledged;
this is not unlike how the virtio configuration system works.


Implementation sketch
---------------------
TBD


Drawbacks
=========

The most obvious drawback is that it isn't a `standard' way of implementing
devices and the hypervisors functionality. But then, if we choose a `standard'
way, we are making use of the hacks they needed to make to circumvent hardware
design choices. In doing so we then don't distinguish ourselves and evolution
stiffens.


Suggestions
===========

There might be place for an ioMMU that can do gather/scatter in stead of the
current two read and write areas declared but it would introduce a page
walker.

If there are enough Application CPUs, one could simplify and speed up
interrupt delivery by demanding some or each OS/domain to have at least one
CPU tied to it permanently. In a typical situation this could hold already for
4 CPUs as it already allows for tied CPUs for a main OS instantiation and an
OS development instantiation and still have two Application CPUs free for
general purpose switching.


Alternatives
============

A traditional hypervisor without the strict memory separation that is using
techniques originally designed for standard SoC architectures.


Unresolved questions
====================

The speed of interrupt taking on devices running on the Application CPUs:
taking an interrupt on switched in domains ought to take a few cycles only but
mapped-out domains takes some time on the hypervisor CPU and the time for a
context switch. It might be neglectable considering its relative rareness.



More information about the lowrisc-dev mailing list