Work document on lowRISC SoC structure, v5 Version 6 June 2016 Author Reinoud Zandijk Pre-amble ========= When the word `domain' is used here, it is referring to either an operating system or a privileged critical subsystem. It assumes a SoC to be split in three sections: general purpose beefy Application CPUs, a set of specialized Minion CPUs and one dedicated `hypervisor` Minion. The Application CPUs can only access the DRAM, the specialized Minion CPUs can directly talk to hardware and can optionally also access the DRAM. The `hypervisor` CPU runs the SoC and board firmware only, Summary ======= This document proposes an integrated system of memory protection, device communication, interrupt handling and integrated debugging facilities. It facilitates running multiple domains that can be as simple as a random number device, a high-level privileged device abstraction system to a full fledged operating system. The proposed structure facilitates virtualisation trough a hypervisor that is (nearly almost) implemented in hardware: neither the Minions- nor the Application-CPUs run any software part. It also inherently features an excellent debugging interface as a consequence. The proposed implementation does have some implications for operating system to device communication infrastructure. In contrast to contemporary systems, there is no notion of running non-virtualized code. Changelog ========= Motivation ========== Running a hypervisor in software on the Application CPUs has the disadvantage that the hypervisor is running in a domain of its own on the CPUs, involving switching contexts on hypercalls and on taking interrupts, polluting the L1 and L2 caches and so on. It also involves keeping a shared hypervisor administration on the Application CPUs. The resulting lock contention on accessing the administration can be mitigated somewhat by using a multikernel design at the cost of taking more IPIs. The proposed structure tries to integrate the various subsystems like virtualization, the debug infrastructure and the security enforcing into one unit that is invisible to the running OSes other than providing an HAL. Detailed design =============== The proposed structure is based on the fact that for each hart its virtualization support and its (invasive) debug infrastructure share the same logic: controlling execution on the CPU, reading and writing CPU general purpose registers, its CSR registers and various internal registers. To facilitate this access, each hart has (at least conceptionally) a debug bus for read/write access to its general purpose registers (x1-x31, f0-f31, ...) and their CPU control registers (CSRs, run/stop/step/watch etc.). All Application CPU harts `guts' are thus accessible on the debug bus, but to the hypervisor CPU only. Note that switching can be sped up by having the hardware set a bit on every register that has been changed. This allows for only those registers to be saved that are actually changed. Restoring them back again always need a full write though. Each Application hart is also given a domain_id register, accessible only and written by the hypervisor Minion. This register value is used as a key to identify the domain running on the CPU to the various subsystems it connects to as well for tagging entries in its L1 cache and for device access checking. Memory isolation ---------------- For complete isolation and security enforcing, each Application CPU hart's memory accesses, either trough or bypassing the L2 cache, is to be protected by a memory segmentation access checker checking all accesses to the physical memory segments regardless of the CPU mode. For paged modes the segmentation access checker described in the memory carving proposal will suffice but for the non-paged machine mode each access will need to be checked! If the domain is behaving like it ought to do this will never trap nor have noticeable overhead. The configuration of this segmentation access checker is considered part of the state of the CPU and also only accessible to the hypervisor Minion on the debug bus. Note that the segment checker can only be overruled by the device sections as defined below. hypervisor ---------- As stated, all the hypervisors algorithms run on a separate dedicated Minion CPU. It is important that this Minion has exclusive access to the debug bus. Preferably it has a scratchpad memory of its own and a gateway to the L2 cache. It uses this for generic memory scribbling; reserving one memory segment for its exclusive use to provide tables for various subsystems if needed. Booting the SoC --------------- On booting the SoC, all Application- and Minion-CPUs stay halted and the FlashROM contents is copied to the hypervisor Minions memory. The hypervisor CPU Minion is then started up allowing it to initialize the board, the SDRAM controllers and all other subsystems. Regardless of how the Minions are implemented their program memory is set up by the hypervisor Minion and they are released. All Application CPUs are started by the hypervisor module when needed by seeding a register set and environment for them to run in and releasing them. Communication ------------- The integrated debug and hardware hypervisor scheme has issues with using FIFOs as a way of communicating due to FIFOs having a state. This could be overcome by creating a send and receive space in a CSR range. Messages are sent or acknowledged by setting CSR bits that atomically streams the piece of memory in a fire-and-forget way. As an alternative, a virtio PCI or mmap virtio scheme could be used but its very tailored to running a virtual environment in a master-slave configuration due to its buffer design where buffers are declared in the slave memory and its dependency on the master environment being able to always read or write the slave memory. Better would suit a tailored solution, effectively providing a safe way to fix memory access issues in one go and run a virtio alike transport over them. For this we define channels each with an unique channel number. Each channel number maps to a septuple defining the channel characteristics i.e. (domain_id, chan_select) -> (chan_rdbase, chan_rdlen, chan_wrbase, chan_wrlen, dest_domain:dest_intbit). We also define the following CSR registers: * chan_select, a read/writeable CSR registers that selects the channel to use. * chan_intr, a write-only CSR register that generates the interrupt for the destination domain marking the specified interrupt bit high. * chan_rdbase and chan_rdlen, a set of read-only CSR registers that specify a range where the current selected channel has its read-only section. * chan_wrbase and chan_wrlen, a set of read-only CSR registers that specify a range where the current selected channel has its write-only section. The chan_select CSR register is part of the domain context just as the domain_id CSR is. Changing the extents while swapped-out will cause a big problem unless the hypervisor Minion will auto-fill in the selected chan_select's data on swap-in. The channel numbers are free-form since they'll be coupled to a specific domain. Accessing memory in the [chan_rdbase, chan_rdbase + chan_rdlen> or in the [chan_wrbase, chan_wrbase + chan_wrlen> ranges should *overrule* the segment checker thus allowing inter-segment communication trough the devices removing the problem with memory sharing though software ought to take `normal' inter hart memory ordering into consideration since its destination is not known. There might be more than one channel available to reduce the number of misses and to allow fast inter-device copying since two devices can then be selected simultaneously and their memory copied. Implementing communication -------------------------- On chan_select writing, the read-only CSR registers defined above are set by the hypervisor Minion with some help of say a SoC wide (few entry) LRU cache as quick-lookup if it turns out to be beneficial. On misses in LRU the hypervisor Minion CPU may be interrupted to provide the data to the arrays. Note that the interface scheme is *the same* for the `normal` Minions as for the Application CPUs. This allows low-level devices to be implemented on a Minion and high-level devices to be implemented in domains of their own on the Application CPUs without the destination having to know the difference. Some care needs to be taken to prevent the read and write ranges to not overlap the scratchpad and mmio memory ranges of the Minions. To avoid this the hypervisor Minion can claim the overlapping section for its internal administration. On writing chan_intr, an interrupt will be signaled for the associated dest_domain:dest_intbit domain using the interrupt bit. If its a permanent domain, of say a Minion, its flagged and delivered directly according to its settings. On non-permanent domains, say running on an Application CPU, its flagged and delivered directly when one of the CPUs is running the domain or otherwise its noted by the hypervisor Minion as a possible reason to swap domains on one of the Application CPUs. Taking interrupts --------------------------------- Each CPU gets an interrupt enable, interrupt req and interrupt ack register just as standard per RISC-V spec. The bit to light up on interrupt is free to set up by the initialisation as its the part of the channel setup. Device discovery and resource management ---------------------------------------- Since all hypervisor code runs *only* on the dedicated hypervisor Minion, its administration is all in one place with only one CPU. It can even be implemented as a single threaded application and is stored in a FlashROM that is SoC or even board-specific. All device and resource management is thus kept in this one administration. Each logical device can be mapped to one (or more) communication channels. For a simple device interface, a Minion can reserve two areas of its memory, one read-only for receiving data and one write-only for sending data and specify an interrupt pin it wants to see set when there is a notification. It also picks a number for the channel; this is free to choose since its matched to its domain_id. It then sends this information to the hypervisor Minion over its channel. Alternatively the memory ranges can be requested from the hypervisor Minion. This prevents the need for a Minion domain to always have a DRAM segment and allows multiple small device drivers (channel providers) to share one segment. For each domain a list of accessible devices/channels is maintained. Most of the devices will be implemented on the Minions but some higher level will be implemented by other domains. When a domain discovers a device, a connection request can be issued to the hypervisor Minion with the interrupt pin it wants to receive notifications on. The channel numbers are free-form since they'll be coupled to a specific domain anyway. Hardware device access ---------------------- No Application CPU and thus no OS can directly access hardware, all is done trough the device interface described above. There is no hardware PIC. The channel interface described above allows the OS to group interrupts on interrupt pins it wants clustered. In combination with enable bits allows one CPU to have an easy software PIC functionality. Inter-CPU interrupt priority within one OS/domain might need some more support. Only Minions are able to access hardware directly trough whatever way they allowed to. An abstracted interface with basic (low?) level drivers are exported by them trough the device scheme above. Higher level drivers can be implemented and provided on the Application CPUs as they can communicate to low level drivers running on Minions and export a high level interface for it. Software implications --------------------- Each domain running on Application CPUs gets a set of non-contiguous memory segments assigned to operate in. Even when the Application CPU switches to machine mode, these segments are still enforced. The only time it can reach segments that it doesn't own are when communicating with devices its assigned. Each domain running on a Minion has its piece of scratch memory to operate on and be given access from zero to say two domains of the DRAM for buffers/channels. It can then create a device by communicating over the hypervisor Minions device exporting pieces of its DRAM for communication channels. The protocol used for inter domain communication needs to be defined to allow it to have a read-only and a write-only section. The read-only section typically holds the status of the device in the first kilobytes followed by buffers containing all the data travelling from device to its consumer. The write-only section typically holds a `shadow' copy of the status in the first kilobytes followed by buffers containing all the data travelling from consumer to device. I see no reason why the spaces couldn't map the same range but providing two separate areas allows for more access control. Some care needs to be taken on the protocol to separate requests from state. Even when the read-only and write-only areas are then overlapping, all device configuration change requests can be done by writing in the matching space of the write-only area and signalling the device owner. When the change is then applied and detected in the read-only area it can be seen as acknowledged; this is not unlike how the virtio configuration system works. Implementation sketch --------------------- TBD Drawbacks ========= The most obvious drawback is that it isn't a `standard' way of implementing devices and the hypervisors functionality. But then, if we choose a `standard' way, we are making use of the hacks they needed to make to circumvent hardware design choices. In doing so we then don't distinguish ourselves and evolution stiffens. Suggestions =========== There might be place for an ioMMU that can do gather/scatter in stead of the current two read and write areas declared but it would introduce a page walker. If there are enough Application CPUs, one could simplify and speed up interrupt delivery by demanding some or each OS/domain to have at least one CPU tied to it permanently. In a typical situation this could hold already for 4 CPUs as it already allows for tied CPUs for a main OS instantiation and an OS development instantiation and still have two Application CPUs free for general purpose switching. Alternatives ============ A traditional hypervisor without the strict memory separation that is using techniques originally designed for standard SoC architectures. Unresolved questions ==================== The speed of interrupt taking on devices running on the Application CPUs: taking an interrupt on switched in domains ought to take a few cycles only but mapped-out domains takes some time on the hypervisor CPU and the time for a context switch. It might be neglectable considering its relative rareness.