[lowrisc-dev] Porting tagged memory support to current version of RISC-V Rocket Chip

Zhe Cheng Lee zhechenglee9 at gmail.com
Tue Oct 13 21:55:20 BST 2015


Thanks for the response. I don't have access to my development environment
right now, so just to be sure: when you said to use pk on the FPGA, you
mean something like this (after booting the Zynq board and mounting the SD

./fesvr-zynq pk -c /sdcard/path/to/bzip2_base.riscv ${input} > /path/to/output

Speckle generated a run script for a portable directory containing the SPEC
benchmark binaries and input data compiled with RISC-V Linux GCC. So for
FPGA runs (not in booted Linux), instead of copying the directory into the
Linux root image, we would just copy it to a separate location in the SD
card and execute the run script. We would also need to modify the run
script so that instead of calling spike, it calls fesvr-zynq. Is this

-Zhe Cheng

On Tue, Oct 13, 2015 at 4:54 AM, Wei Song <ws327 at cam.ac.uk> wrote:

> Hello Zhe Cheng,
> I am afraid it is a bit complicated.
> Most SPEC benchmark cases (at least for the subset of integer suite) are
> single thread programs.
> Single thread programs can run in bare-metal mode (use pk) instead of
> inside Linux, which is better for collecting miss rate and running time.
> So I did it using pk on FPGA rather than booting a Linux.
> You will need to modify the scripts from Speckle to compile for FPGA runs
> (if I remember it right, use the linux gcc rather than newlib gcc).
> pk has an option to report running time (actually it is the cycle count)
> before exit.
> For missing rate, I had added a lot of hardware performance counters in
> L1/L2 to collect cache requests and misses.
> Some of the code can be found in
> https://github.com/lowrisc/lowrisc-chip.git branch "perform".
> However, it is the version for the old code base and without L2.
> I have not gotten time to update the latest code base yet.
> I have also modified fesvr to read/report the performance counters before
> exit.
> If you need to run benchmarks in Linux, you do not need to install SPEC (I
> believe) but it is still difficult for the following reasons:
> 1. When running benchmarks, the Linux kernel has booted, L1/L2 caches are
> occupied (not empty).
> 2. Kernel may affect the running time.
> 3. There is no easy way of reporting the value of performance counters as
> they are implemented as CSRs. Reading CSRs in user mode program needs
> special syscalls added to the Linux kernel.
> 4. The timing (cycle count) function is also on the supervisor side.
> But if the only thing interests you is the running time, in Linux should
> be OK.
> Oh, be careful the time you get from the FPGA Linux, which is not real
> time.
> If I remember it right, there is a static configuration in the kernel to
> define the clock frequency, with no regard to the real FPGA clock frequency.
> Good luck!
> Wei
> On 13/10/2015 04:31, Zhe Cheng Lee wrote:
> Hello Wei,
> We want to measure the execution time in the FPGA using SPEC benchmarks.
> We have compiled the SPEC benchmark binaries with Speckle. After moving the
> SPEC binaries (the .riscv files e.g. bzip2_base.riscv, correct?) into the
> Linux root image and booting Linux at the rocket-chip FPGA, how exactly do
> we run the binaries in the booted Linux? Do we need to install SPEC in
> the FPGA? If so, how?
> Also, in the paper, there are measurements such as MPKI. Is this a
> measurement given by running SPEC alone, or is it a measurement you
> modelled?
> Thanks,
> -Zhe Cheng
> On Sat, Oct 10, 2015 at 12:11 PM, Wei Song <ws327 at cam.ac.uk> wrote:
>> Hello Monjur,
>> I had run SPEC 2006 Integer cases on a Zedboard using the script from
>> Speckle, although not all cases.
>> You can have a look of the results in
>> <http://wsong83.github.io/publication/comparch/riscv2015.pdf>
>> http://wsong83.github.io/publication/comparch/riscv2015.pdf
>> These are the result results collected from FPGA runs.
>> Best regards,
>> Wei
>> On 09/10/15 22:19, Monjur Alam wrote:
>> Hi Wei,
>> I got your point. Answer to your question is No, it does not fill the
>> cache with fake tag after reset. And, you are write, always miss happen at
>> the the beginning just after reset. Thanks for pointing this.
>> One more suggestion please; do you ever run SPEC CPU2006 on top of
>> rocket-chip on FPGA. I have created a stackOverflow question (
>> <http://stackoverflow.com/questions/33004581/running-spec06-with-riscv-architecture>
>> http://stackoverflow.com/questions/33004581/running-spec06-with-riscv-architecture).
>> The Speckle provides a wrapper for that to run spike. But, spike has no
>> connection with rocket-chip. I think, running CPU2006  on top of
>> rocket-chip on FPGA will demonstrate the performance overhead of real
>> architecture.
>> Your opinion please.
>> Regards,
>> Monjur
>> On Tue, Oct 6, 2015 at 4:36 AM, Wei Song < <ws327 at cam.ac.uk>
>> ws327 at cam.ac.uk> wrote:
>>> Hello Monjur,
>>> The reasoning for tag cache is to reduce the traffic to DRAM.
>>> In lowRISC, tags and data are stored separately in different DRAM
>>> partitions.
>>> So a miss in L1 will cause at least two DRAM reads (one for data and one
>>> for tag).
>>> The total DRAM traffic is increased by 100%.
>>> A tag cache is supposed to reduce the amount of tag traffic but does not
>>> help on data traffic.
>>> If in your case the tag cache is always hit, I am also wondering why
>>> there is this 22% overhead.
>>> However, a big tag cache does not guarantee hit.
>>> Is you tag cache kind of dummy, which I mean the tag cache provides fake
>>> tags without the need to fill empty cache lines even after reset?
>>> Otherwise, the tag cache is empty at the beginning and there will be
>>> compulsory misses after reset.
>>> Best regards,
>>> Wei
>>> On 05/10/2015 23:03, Monjur Alam wrote:
>>> Hi Wei,
>>> Thank you very much for your help through out by providing valuable
>>> suggestion.
>>> So far, we have implemented tag support of riscv for L1 (will add L2
>>> later on). The architecture is (more or less same as lowRisc):
>>> 0. Unlike lowRisk, we perform basic operations (load, store) for data
>>> and tag parallel.
>>> 1. Extend data cache 1 bit / double word
>>> 2. Added tag cache that resides between L1 and DRAM
>>> 3. Design a tagger module for making bridge between tagCache and DDR3
>>> But, we have seen that the performance is degraded around 22%; we have
>>> tested it by existing benchmarks. We are planing to map the design into
>>> zc706 FPGA and to run SPEC benchmark on our architecture.
>>> 1. As tag cache (32 MB) assure tag hit, why such performance degradation
>>> (22%)?
>>> 2. Does tag cache conceptually help for data miss (not tag miss).
>>> Because, data miss fetch DRAM, so completion of operation depends on data
>>> fetch, not only tag even tag is fetched from tag cache which is faster?
>>> 3. Do we really need tag cache, we can fetch tag from DRAM like data.
>>> Your suggestion please.
>>> Regards,
>>> Monjur
>>> On Tue, Sep 22, 2015 at 4:27 AM, Wei Song < <ws327 at cam.ac.uk>
>>> ws327 at cam.ac.uk> wrote:
>>>> Hello Zhe Cheng,
>>>> Actually extending tags in L2 is very simple.
>>>> L2 is ignorant to the content of cache lines. What you need to do is to
>>>> extend the size of data array.
>>>> TileLink is the communication fabric used internally in Rocket.
>>>> Both the broadcasting hub and L2 use the same TileLink/MemIO converter,
>>>> you you do not need to revise a new converter.
>>>> At start, HTIF writes program to L2. When L2 needs to write back, some
>>>> cache line is then written to memory using the TileLink/MemIO converter.
>>>> Seems like you have made to broadcast one working already.
>>>> Best regards,
>>>> Wei
>>>> On 22/09/2015 00:54, Zhe Cheng Lee wrote:
>>>> Hello Wei,
>>>> Than you for your response. I was previously using a broadcast
>>>> coherence hub instead of a L2, but now I have moved to using an L2 after
>>>> verifying that tag bits can be stored to and loaded from the L1 caches fine
>>>> in my modifications to the rocket chip. In this case, will the data be
>>>> written from HTIF to L2 through a different converter? Is there a
>>>> TileLink-to-L2 data converter?
>>>> Best regards.
>>>> -Zhe Cheng
>>>> On Sat, Sep 19, 2015 at 9:15 AM, Wei Song < <ws327 at cam.ac.uk>
>>>> ws327 at cam.ac.uk> wrote:
>>>>> Hello Zhe Cheng,
>>>>> I just noticed another issue which may or may not cause the error.
>>>>> Since you do not want to use the tag cache, I assume you are using the
>>>>> original MemIOUncachedTileLinkIOConverter to covert TileLink messages to
>>>>> MemIO messages.
>>>>> Also I assume you are using the broadcast coherence hub instead of
>>>>> using a L2.
>>>>> In this case, the data written from HTIF are always written to memory
>>>>> through this MemIO/TileLinke converter.
>>>>> You need to remove tags for messages from TileLink to MemIO and add
>>>>> tags for messages from MemIO to TileLinks.
>>>>> Tag cache does the conversion so I did not change the code of this
>>>>> MemIO/TileLinke converter.
>>>>> But some revision is needed in your case. Something like what the HTIF
>>>>> and icache has been done.
>>>>> The assembly seems from the dump file, which is correct to my eyes.
>>>>> The difference between trace file and dump file would reveal more
>>>>> insights.
>>>>> If you think the value load to gp is wrong, may be have a look of the
>>>>> test case and try to figure out what exactly wrong would help you debug.
>>>>> I think it is the test case test_3 in riscv-tests/isa/rv64ui/ld.S.
>>>>> Best regards,
>>>>> Wei
>>>>> On 18/09/15 23:59, Zhe Cheng Lee wrote:
>>>>> Hi Wei,
>>>>> Thank you very much for your response. It is indeed complicated to get
>>>>> this to really work. I found your response helpful, though. I didn't
>>>>> consider HTIF before when modifying the current rocket chip. I can see why
>>>>> HTIF is imporant then.
>>>>> By control path, do you mean the control signals associated with the
>>>>> new instructions and the logic to handling them? If so, then yes, I have
>>>>> changed it.
>>>>> I added the tag utilities (I changed the data types in these tag
>>>>> function from Bits to UInt) and modified the corresponding lines in
>>>>> htif.scala accordingly to the changes in this commit
>>>>> <https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323a1a5af41#diff-228d7a2c10baa84f6595aeec2d50174b>
>>>>> to support tag memory, but the simulations still have not passed.
>>>>> As a side note, I added the changes in icache.scala to remove the tags
>>>>> at the line to be presented to the instruction cache as well, but when I
>>>>> compared, say, rv64ui-p-ld test .out simulated from the latest rocket-chip
>>>>> with the .out file from my changes to it, I noticed that the two PCs differ
>>>>> after several instructions when the program actually starts. When I revert
>>>>> back the changes in icache.scala (as in, removeTag doesn't get called), the
>>>>> two PCs start deviating later on instead of within the first few after the
>>>>> program starts. Does the L1 instruction caches not interact with HTIF?
>>>>> Without removing the tags in the instruction cache, the PCs begin to
>>>>> deviate after the branch instruction in:
>>>>>  27c:   0080b183            ld  gp,8(ra)
>>>>>  280:   ff010eb7            lui t4,0xff010
>>>>>  284:   f01e8e9b            addiw   t4,t4,-255
>>>>>  288:   010e9e93            slli    t4,t4,0x10
>>>>>  28c:   f01e8e93            addi    t4,t4,-255 # ffffffffff00ff01
>>>>> <_end+0xffffffffff00eee1>
>>>>>  290:   010e9e93            slli    t4,t4,0x10
>>>>>  294:   f00e8e93            addi    t4,t4,-256
>>>>>  298:   00300e13            li  t3,3
>>>>>  29c:   37d19c63            bne gp,t4,614 <fail>
>>>>> I am guessing the correct data isn't loaded to gp? How do I check this
>>>>> in the output file? I thought gp is the alias for register 31, but I don't
>>>>> see r31 around gp at that point.
>>>>> Thanks.
>>>>> On Fri, Sep 18, 2015 at 4:36 AM, Wei Song < <ws327 at cam.ac.uk>
>>>>> ws327 at cam.ac.uk> wrote:
>>>>>> Hello Zhe Cheng,
>>>>>> I think you are probably right on what is needed for supporting tags
>>>>>> on
>>>>>> the latest rocket repo.
>>>>>> However, it is always complicated to make it really work.
>>>>>> One thing I noticed is that you probably need to apply the changes to
>>>>>> htif.scala as well if you have not done so.
>>>>>> The tags are stored in a cache line in a way like
>>>>>> [tag][word][tag][word]....
>>>>>> The insertTag() and removeTag() in HTIF will make sure tag/data end up
>>>>>> in the right interleaved position inside a cache line.
>>>>>> Host interface (HTIF) is very important as the test programs (elf/hex)
>>>>>> are written to memory/L2 through it.
>>>>>> I think the host interface may have written totally unaligned program
>>>>>> to
>>>>>> memory due to the lack of insertTag() function.
>>>>>> Also you need to revise the control path of the rocket core, which I
>>>>>> think you have done so.
>>>>>> For general debugging tips, you can compare the traces from simulation
>>>>>> with the dump files of the test programs.
>>>>>> Making sure the rocket processor is running the correct instructions
>>>>>> would be my first check.
>>>>>> BTW, I am working on bringing up a standard-alone lowRISC with tag
>>>>>> support based on the latest Rocket chip.
>>>>>> However, it is a slow process and I will need at least a couple of
>>>>>> months on it.
>>>>>> You will be able to run on a clean design if you can wait that long.
>>>>>> Or if you would like to help, see the "update" branch of
>>>>>> lowrisc-chip.git.
>>>>>> I am working on peripherals now. Tag support is not added yet, so I
>>>>>> can
>>>>>> use some help to bring back tag support to the new code.
>>>>>> Hope this is helpful,
>>>>>> Wei
>>>>>> On 18/09/2015 00:32, Zhe Cheng Lee wrote:
>>>>>> > Hi, all,
>>>>>> >
>>>>>> > Has anyone successfully port lowRISC changes to support tagged
>>>>>> memory to a
>>>>>> > more updated version of the rocket chip repository (e.g. develop
>>>>>> lowRISC
>>>>>> > from a more updated version of the rocket chip repository)?
>>>>>> >
>>>>>> > I want to develop a design module that rely on those tagged memory
>>>>>> bits and
>>>>>> > are to be integrated with the most recent version of the rocket
>>>>>> chip. At
>>>>>> > this stage of my development process, I just want at least the L1
>>>>>> caches to
>>>>>> > support tagged memory. In other words, I'm not concerned about
>>>>>> including
>>>>>> > the tag cache or supporting tagged memory in main memory right now.
>>>>>> I'm
>>>>>> > having trouble successfully pushing the tags into the L1 caches. I
>>>>>> have
>>>>>> > already added the load/store tag instruction decoding and encoding
>>>>>> (I'm
>>>>>> > aware that the order of the control signals in the decode table has
>>>>>> been
>>>>>> > changed a bit since the rocket-chip version lowRISC is based off
>>>>>> of), the
>>>>>> > new memory access type constant MT_T, and the necessary config
>>>>>> parameters.
>>>>>> >
>>>>>> > At first, I thought I just need to include the highlighted
>>>>>> modifications in
>>>>>> > lowRISC's nbdcache.scala from
>>>>>> >
>>>>>> <https://github.com/lowRISC/rocket/commit/51f65e2dce1bc60ef37c6da956bd8f9c8972961b#diff-de7e6f4be95f6d3b7e13d6c32e5c9783>
>>>>>> https://github.com/lowRISC/rocket/commit/51f65e2dce1bc60ef37c6da956bd8f9c8972961b#diff-de7e6f4be95f6d3b7e13d6c32e5c9783
>>>>>> > and in its tilelink.scala from
>>>>>> >
>>>>>> <https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323a1a5af41#diff-228d7a2c10baa84f6595aeec2d50174b>
>>>>>> https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323a1a5af41#diff-228d7a2c10baa84f6595aeec2d50174b
>>>>>> > to the corresponding places in rocket-chip's nbdcache.scala,
>>>>>> cache.scala,
>>>>>> > and tilelink.scala. Even without the tag utilities and tag cache,
>>>>>> this
>>>>>> > should be fine just for testing existing instructions, since those
>>>>>> tag bits
>>>>>> > would just be ignored in those cases, correct? But with that, the
>>>>>> > simulations do not pass the prebuilt tests and benchmarks that
>>>>>> don't test
>>>>>> > the load/store tag instructions.
>>>>>> >
>>>>>> > Can anyone help with this?
>>>>>> >
>>>>>> > Thanks.

More information about the lowrisc-dev mailing list