Work document on lowRISC SoC Minions Version 30 December 2016 Author Reinoud Zandijk Pre-amble ========= A minion is defined in this context as a relatively small RISCV CPU to run soft-hardware and/or encryption effectively acting as a HAL. Multiple minion CPUs are expected to be used in parallel and are to have low latency to the hardware parts. Summary ======= This proposal is based on the work done on PULP, a parallel ultra-low power SoC. It uses its notion that memory access is hierarchical and non-local memory is not directly accessible but can only be transferred by DMA to local scratch memory giving the CPU lots of time to do other useful things while waiting for data to be present. It differs from PULP in that it looks like the PULPino as its envisioned to have one scratch memory space and one DMA channel per minion CPU completely decoupled from each other. Changelog ========= Motivation ========== Using a normal riscv core on the main bus as a minion, like say a Rocket in parallel to other CPUs like other Rockets or beefier BOOMs, has the disadvantage that timing becomes unpredictable as its access time and stalling is not predictable. Connecting this way as a regular CPU can be done but severely limits its usability. Detailed design =============== Structure --------- To provide predictable timing and isolation each minion CPU has to be given a private memory bus. On this private memory bus is a fixed piece of scratchpad SRAM for both code and state storage. Also on its private memory bus are the hardware interfaces dedicated to this minion, be it trough SPI, GPIO or otherwise. To communicate with the outside world, each minion has a DMA controller for moving data from/to its private memory bus to/from main memory on the main memory bus. It can use this to provide devices or services to other partners. A minions register set, its CSR registers and run-state registers (run/stop etc) ought to be mapped on the main bus on reset. The GP and CSR registers can later be decoupled, if desired, leaving only the run-state registers mapped. For protection, a minion unit ought to have its DMA engine checked against a segment/secondary page table as per the memory segmentation proposal on setup. This will require some extra logic like a TLB though. Note that its TLB is almost never supposed to be cleared but it ought to be possible on reconnection of a device or other major setup. +-----------------------------------+ | +--------+ || +-------+ | | | SRAM | <=> || <=> | SHIM | <==> IO hardware | +--------+ || +-------+ | | || | | +--------+ || +-------+ | => regs <==> | CPU | <=> || <=> | DMA | | | +--------+ +-------+ | | | TLB | | => mem <========================> | | | | +-------+ | +-----------------------------------+ Booting ------- On reset, the minion CPU is stalled until its released externally by writing its CPU regs. An extra set of CSR registers could give access externally to the SRAM internally for either debugging and for setting up the code section in its private SRAM. CPU features ------------ The minion CPUs can be as simple as needed for the task. A RV32I with only machine mode support could suffice; its DMA engine would then be adapted to read/write the entire 64 bit memory space. As for the other (standard) extensions, they can be added if deemed handy. I'd advice to use at least RV32G. Note that a minion CPU doesn't need to know about tags nor about segmentation. The relevant support of those features can be in the DMA/TLB part. An choice of the CPU could be to use a flexPRET core for more realtime performance guarantees. Drawbacks ========= The setup will need explicit programming for, but then its hardware is custom and/or tailored to a specific task as well. Its limited SRAM could be limiting the amount of work/preprocessing its capable of so it should be used wisely as it resembles more an embedded system. Suggestions and alternatives ============================ I'd suggest to use at least an RV32G as CPU, say a 32 bit Rocket or smaller, maybe even without L1 caches as its chip surface area could be better used on the fast SRAM. As an alternative, its entire SRAM area could be used to form one big L1 cache with the DMA engine automatically fetching on misses. I'd vote against though since DMA on the private memory bus from/to a piece of hardware would be troublesome... Unresolved questions ==================== Would it suffice to have a status bit showing external interrupts since the minion CPUs are most likely running a tight service loop where all I/O processing is done? Interrupting tight timing code could be fatal to the piece of code and hard deadlines will be missed. A statically scheduled multitasking state machine loop with deadline enforcing could very well suffice.