Work document on lowRISC SoC, PUMP-lite, a PUMP like tag processing engine Version 12 May 2015 Author Reinoud Zandijk Pre-amble ========= Summary ======= The proposed hardware is a derivative of the generic programmable unit for metadata processing (PUMP) tag processing engine, hence the name PUMP-lite for now. A PUMP engine has no fixed ruleset but uses runtime supplied rules held in an associative rule cache. The proposed PUMP-lite differs from the origional PUMP in that it is not demanding tags to be pointer sized nor taking the address of the instruction into consideration. This will impose limits on strict CFI enforcing though still providing basic CFI. The main reason for proposing PUMP-lite is to keep the benefits of having a PUMP mechanism while significantly reducing rule cache hardware cost and tag memory cost. Changelog ========= Motivation ========== The origional PUMP proposal has a quite hefty hardware cost with a two levels of respectively 1024 and 4096 entry caches with a 328b match and 128b out not to mention its 100% memory overhead by requiring the number of tag bits to be the size of a pointer. The modified PUMP proposal drasticly reduces the number of rules needed to cache by canonicalising the bitvector to lookup based on its instruction opcode. This canonicalising consists of masking the bitvector enabling or disabling fields before the lookup. In an effort to tackle these overheads, the proposed PUMP-lite drops the requirement to have pointer sized tags to an arbitrary amount and drops the program counter as bitvector component. This will both reduce the memory overhead as well as the number of bits in the match of the caches. The lookup bitvector will shrink from 8+5*64 = 328b to 8+5*4 = 28b when using 4 bit tags and output 4 (+flag bits) in this case. The canonicalizing is also applicable to RISCV since there are instructions having one, two or three or four registers. Register slots that are not needed for this instruction should be zero'd. Examining the RV32IMAFD together with RV64IMAFD this consists of: #regs #instr #possibilies 0 4 4 1 14 14 * T (rd) 2 69 69 * T*T (rd + rs1 || rs1 + rs2) 3 72 72 * T*T*T (rd + rs1 + rs2) 4 8 8 * T*T*T*T (rd + rs1 + rs2 + rs3) ---- ------------ 167 4 + 14*T + 69*T^2 + 72*T^3 + 8*T^4 For 4 tag bits there are 13.4 millon different combinations possible when just one tag is present for the code. If multiple instruction tags are valid the number of valid combinations are simply multiplied to a maximum of 214 million combinations. With a bigger number of tag bits the amount really explodes into the silly range as expected. The effect of dropping pointer sized tag requirement on security policies results in: NXD+NWC (code injection) ------------------------ No effect since the maximum number of tags is two, DATA or CODE. Memory safety (spacial/temporal memory safety violation on heap) ---------------------------------------------------------------- We can't track pointer colours nor memory payload colours in the full way like PUMP allows us. We could as a minimum use the tags to indicate used, free, freed or zero'd memory status next to the code tag. CFI (control flow hijacking) ---------------------------- We can't have an unique ID for each indirect control flow source and target but we can tag instructions indicating either valid subroutine entry point or valid return point. On the data we can tag return addresses and tag valid code pointers on a heap or stack. This won't allow random jumps into code other than its entry points and won't allow returning from a branch to a place that isn't marked as such. What we can't do is check if the return is indeed to the place it came from. Taint tracking -------------- We can taint data and code coming from some external sources and treat them as less confidential. If there are effectively two code tags, one could be marked safe-for-external-code too so external downloaded code can only call the destined routines and nothing else. Combined -------- All the rules together we can manage with a minimum of 1+4+2+2+1 or 10 values so a 4 bit tag is the minimum if we want to cater for the minimum protection indicated above. Detailed design =============== As in other tagged processors each register and cache is augmented with a tag and this information is passed around. The PUMP-lite system is positioned between the memory and the writeback stage. Each write operation is turned into a read-modify-write operation. The existing tag is read during the memory stage like a read, the rules are checked in the PUMP-lite stage and the write is performed during the commit/writeback stage. PUMP-lite system is conceptionally a function f(instr, Tinstr, Td, Ts1, Ts2, Ts3, Tm) -> (Tres, ...) that returns a destination tag and if needed various flags to indicate if the instruction was allowed or not etc. Note that Ts3 and Tm never come together so they can be fused as one parameter. There can either be the destination register or the memory locations tag. Rules for storing instructions will (for example) allow checking for read-only areas, code or fences being overwritten. Rules for loading instructions will (for example) allow for checking accesses to previous freed-space or make memory tagged as code unreadable. There is need for a CSR register called `csr_pumpcntr` that is used for enabling/disabling the PUMP, for cache clearing etc. There is need for an interrupt cause, called a PUMP_MISS that goes directly into machine mode. There is need for a CSR register called `csr_pump_taglookup_vector' to keep the calculated tag that wasn't found. A series of CSR registers to give instruction, instruction class and to split out the relevant tags directly to remove the need for (costly) splitting out the taglookup_vector in software. Furthermore there is need for at least one CSR register called `csr_pump_insert' for inserting the result entry in the PUMP cache associated with the csr_pump_taglookup_vector. A series of CSR scratch registers would be very usefull to prevent the need of pushing and popping all the registers. The pump miss handler with its rulesets could either be manually written in assembly or maybe even better, contain generated assembly created from a specification. Implementation sketch --------------------- /* generated at chip build time */ const uint8_t flag_mem[NUM_INSTR] = {...}; const uint8_t regslots_mem[NUM_INSTR] = {...}; const uint8_t slots = 5; const uint8_t nbits = 4; const uint8_t tagbits = (1<> (1 * slotbits)) & slotbits; slot_s2 = (regslots >> (2 * slotbits)) & slotbits; slot_s3 = (regslots >> (3 * slotbits)) & slotbits; slot_smem = (regslots >> (4 * slotbits)) & slotbits; use_reg_sd = flags & FLAGS_USE_REG_D; use_reg_s1 = flags & FLAGS_USE_REG_S1; use_reg_s2 = flags & FLAGS_USE_REG_S2; use_reg_s3 = flags & FLAGS_USE_REG_S3; use_reg_smem = flags & FLAGS_USE_REG_MEM; /* * Only use registers in vector that are used. * Can use MUXes for the shift! since its just at max 4 positions. */ taglookup_vector = Tinstr | ((Tsd & use_reg_sd) << (nbits * shift_sd)) | ((Ts1 & use_reg_s1) << (nbits * shift_s1)) | ((Ts2 & use_reg_s2) << (nbits * shift_s2)) | ((Ts3 & use_reg_s3) << (nbits * shift_s3)); } void pump_lite_pump_stage(instruction, taglookup_vector, &Tres) { uint64_t memaddr; uint8_t line, flags; if ((procmode == machine) || !(csr_pumpctrl & PUMP_ENABLED)) { Tres = 0; return; } hit = cache_lookup(taglookup_vector, &line); if (!hit) { /* see description */ csr_pump_taglookup_vector = taglookup_vector; csr_pump_tag_instruction = instruction; csr_pump_tag_Tinstr = taglookup_vector & nbits; csr_pump_tag_s1 = ... ... raise_interrupt(PUMP_MISS); /* no return */ } } Drawbacks ========= The obvious drawback of PUMP-lite over a standard PUMP implementation are the reduced capabilities. The drawback over a non-programmable fixed metadata implemenation, that is implementing a fixed chosen ruleset, is the amount of extra hardware needed. Alternatives ============ Some checks could be implemented by explicit checking tags using special instructions but it won't be able to implement all features. Unresolved questions ==================== The amount of rule cache entries and their performance implications are open for research. Should the rules be defined Hypervisor wide, OS wide or even process wide? How many scratch registers should be provided and how compact and fast can the generated handler become?