Work document on lowRISC SoC, memory carving up/isolation V2 Version 21 Dec 2016 Author Reinoud Zandijk Pre-amble ========= When the word `domain' is used here, it is referring to either a hypervisor, an operating system or a privileged critical subsystem. Summary ======= To allow for multiple domains to run in parallel, the available physical memory needs to be assigned to the various domains. The proposed hardware assisted partitioning scheme called TLB-fill blocker is situated inside the page table lookup state machine in the CPU and carves up the physical memory space in equal sized blocks in the order of 1 MiByte or 16 MiByte depending on design choices. A CPU is granted access to physical memory or not by a lookup in (conceptionally) a bitmap, independent of any virtual to physical page mapping. The proposed hardware extension maintains the invariant that that if an entry is residing in the TLB, its mapping and page tables have already been checked and approved for access. A revocation of access to a block, a rare event in this scheme, is done by updating the permission bitmap and flushing the TLB. Only on a TLB miss is the check needed and all intermediate PTE addresses as well as the resulting physical address are checked. If access is denied, it results in a trap/fault; otherwise its like the system is not there. To ease memory sharing, this protection is further enhanced by the addition of a secondary page lookup table that is only checked if the normal PTE lookup is successfull but the end result is rejected on its segment. Note that this is NOT equal to a second layer of page tables. Change log ========== Added support for segment check override by secondary (hypervisor) page tables. Motivation ========== Supporting multiple domains each with their own idea of virtual memory comes at a cost. There are various ways to implement support for this. The basic strategies are : * Delegation. The Xen- and conceptionally also the L4re-way is for a domain running on them to ask the hypervisor to map space around. Multiple requests can be packed in one request but a potential costly context switch between the hypervisor and the operating system is still needed on each request. * Use of a second layer of page tables. This method is generally used for emulating operating systems that are not hypervisor aware by using special processor hardware support. Memory is mapped from guest virtual to guest physical and then mapped from guest physical to host physical. The cost of taking a page fault can increase dramatically resulting in not 4 or 5 but as much as 24 fetches on contemporary x86 systems; the double mapping can thus cost very dearly in runtime performance. * Use of page tables modification monitoring. A domains modified page tables are checked before they are activated. This can also be done using shadow page tables that fault in the hypervisor and the hypervisor patching its master version up with the domains version. Depending on how its implemented the costs of switching page tables can still be quite high too due to the checking and extra faults taken but at least it doesn't require additional hardware. The proposed hardware assisted partitioning scheme is an effort to minimize the overhead of having multiple domains to nearly zero runtime cost using minimal hardware support while still providing full memory isolation. When implemented, there might be an additional fetch on a page fault but these fetches should be very rare. Performance of a domain should be practically identical to running native. When accessing shared memory from another domain, a secondary lookup is needed which might also induce some table walking. TLB hits thus require 0 fetches, TLB misses within its own domain needs 3 or 4 fetches and accessing shared memory might add up to 6 to 7 fetches. The results are entered in the TLB even for flat translations so subsequential access are for free for as long as the entry holds. Detailed design =============== The proposed hardware assisted partitioning scheme, the TLB fill blocker, carves up the physical memory space in fixed sized blocks. All the partitioned physical memory form a set with the blocks as members. Each domain is assigned one set descriptor in the form of conceptionally a bitmap. The set descriptors are maintained by a hypervisor and are normally disjoint to each other though not necessarily so. A domain can have multiple blocks assigned to it, but not necessarily contiguous(!). The optimal block size is debatable and dependent on the projected amount of physical memory, be it in 1 MiByte blocks or in 16 MiByte blocks. To control access to the fixed sized blocks, a bitmap is associated with the domain in which each bit maps to the access permission of the corresponding block. The bits are checked by splitting up the physical memory address to be checked as follows (16 MiByte blocks taken in this example) : 6 X 3 2 2 2 0 3 X 0 9 4 3 0 +---------+--------+------+------------------------+ | | index | bit | block sub address | +---------+--------+------+------------------------+ The `index' field of the physical address is an index into an array of 64 bit values that together form the bitmap. The bit to check is encoded in the `bit' field. No layering of data structures within the bitmap are needed; the bitmap table can be as large as needed. If desired, the address shift could be made configurable thus allowing the memory block size to be changed depending on how much physical memory is present and preferences. For small memory systems, or for systems with a fixed number of blocks, the entire bitmap can be kept in a series of 64 bit CSR registers. The advantage of this is that the entire bitmap is in register space easing the implementation. On larger systems, the number of registers would get too big when carving up the large physical memory space in small enough partitions and the CSR register method wouldn't be practical any more. To fix this the bitmap will have to be held in normal memory pointed to by a single CSR register. The bitmap words will have to be cached by a small, dedicated, direct bitmap-cache consisting with say a few dozen entries depending on how large the physical memory is expected to be for the SoC. The proposed hardware extension exists inside the TLB miss PTE walker and scrutinizes (at a minimum) the resulting entry to be put into the TLB but preferably also all the intermediate addresses. The checks can be done in parallel to the memory fetches so there won't be noticeable delays. In a worst case scenario, a TLB miss can induce up to 3 fetches for the PTE; in the pathetic situation where all the indirect entries are in different non cached bitmap-cache entries, this can induce a further 3 fetches for the TLB fill blocker. This situation will not happen in practice since page tables tend to be close enough together and with 16 MiByte blocks one 64 bit cache entry caters for one stretching GiByte! The expected worst case is thus normally at most 1 extra fetch if the resulting PTE address is not in the bitmap-cache. For 16 MiByte blocks, one 8kb page for the bitmap has 64k bits and can thus cater for up to 1Tb of physical memory. Software implications --------------------- To allow running non TLB-fill blocker aware operating systems, it should allow all accesses until it is configured properly. Depending on the actual situation it is implemented for, configuring the TLB-fill blocker can either be restricted to machine mode or even only to be updated from outside the CPU. The rest of the text assumes they are in the CSR space. The proposed registers thus don't have to be accessible from within the CPU to function. The csr_sptbr, csr_hptbr, the csr_shift_cfg and csr_fillblock_bitmap_base registers (see below) can be initialised in whatever order but care might be taken to disable interrupts in the sequence to make the change atomic. Implementation sketch --------------------- There is need for one machine mode CSR register csr_fillblock_bitmap_base to point to the physical base address of the memory block containing the bitmap. Optionally there can be a second machine mode CSR register csr_shift_cfg to configure the block size by specifying the amount of bits to be shifted right for an address to be checked before the `index' and `bit' fields are extracted. 6 X 3 2 2 2 0 3 X 0 9 4 3 0 +---------+--------+------+------------------------+ | | index | bit | block sub address | +---------+--------+------+------------------------+ ===> shift right by fixed or configured amount 6 X 0 0 0 3 X 6 5 0 +----------------------------------+--------+------+ | | index | bit | +----------------------------------+--------+------+ There is also need for a new machine exception PTE_INVALID that can be considered fatal to the domain if desired since it means its page tables are corrupt. On writing any of the two CSR registers, the bitmap cache ought to be invalidated. The secondary page table lookup doesn't need to be checked on section access since its almost by definition outside the normal allocated segments since its administrated outside the domain. Sequential implementation in C pseudo code : bool check_allowed(uint64_t address) { uint64_t tmp, bits, index; uint8_t bit; bool hit; /* not configured? then the hardware is `not present' */ if ((csr_shift_cfg == 0) || (csr_fillblock_bitmap_base == 0)) return true; tmp = address >> csr_shift_cfg; bit = tmp & 63; index = tmp >> 6; hit = bitmap_cache_lookup(index, &bits); if (!hit) { bits = memory_fetch(csr_fillblock_bitmap_base + (index<<3)); bitmap_cache_insert(index, bits); } return (bits & (1 << bit)) ? true : false; } uint64_t translate(uint64_t address, mode, bool flat) { offset = address % PGSIZE; pte = tlb_lookup(address); if (pte) { if (access_ok(pte, mode)) return ADDR(pte); } /* TLB miss */ if (flat) { pte = ADDR2PTE(address, mode); } else { /* TLB miss, first lookup in sptbr */ ptdir = csr_sptbr; for (level = 0; level < 3; level++) { /* page walking with intermediates check */ entry = ptdir + index(address, level); if (!check_allowed(entry)) { raise_exeption(PTE_INVALID); /* no return */ } ptdir = memory_fetch(entry); } pte = ptdir; if (!access_ok(pte, mode)) raise_exception(PAGEFAULT); } /* got the destination address in pte form */ /* destination check */ if (!check_allowed(ADDR(pte))) { /* extension: allow secondary table to overrule */ ptdir = csr_hptbr; for (level = 0; level < 3; level++) { /* page walking with intermediates check */ entry = ptdir + index(address, level); ptdir = memory_fetch(entry); } pte = ptdir; if (!access_ok(pte, mode)) raise_exception(PTE_INVALID); } /* add this possibly translated(!) pte to the tlb */ insert_tlb(pte); return ADDR(pte); } Note that the check_allowed() can be executed in parallel with the memory_fetch() of the PTE walker. The last check_allowed() can be done in parallel with the insert_tlb(). In both cases they have to be synchronised so that it waits until both actions are done. Drawbacks ========= The extension does imply a small, preferably fully associated, cache though it will only be used sporadically and will be mostly dormant but of course it does take up silicon and some power. The proposed hardware segments the physical memory but doesn't glue them logically together. For an OS running on a hypervisor that uses the extension, the physical ram its given will be presented as a set of ram banks of one or a multiple of the given block size. The OS needs support build in for multiple ram banks but luckily most have already. Alternatives ============ Multiple bits per segment in the bitmap could be implemented, say a two bits for R+W bit so pieces can only appear read-only; this feature then has to check and maybe even modify the resulting PTEs before allowing them to be inserted in the TLB. It would result in a doubling of the space needed and the bitmap cache has thus be made twice as big in this example to have the same hit rate. Sharing a block read-only in this way to another domain can be revoked by unsetting the permission and flushing the relevant TLBs. A more generic external device could be constructed that checks all addresses going out of the CPU. This will separate the checking from the TLB but will require significantly more bitmap cache checks and could blow up when the L1/L2 cache starts reading/writing and not on the actual lookup itself. Not a preferred situation. Unresolved questions ==================== The optimal size of the bitmap cache and organization is not known. A simulation of a big memory hugging benchmark might give some more answers on the exact overhead and optimal cache size. The naming of the extension as well as their registers might be improved upon.