[lowrisc-dev] Porting tagged memory support to current version of RISC-V Rocket Chip

Wei Song ws327 at cam.ac.uk
Tue Oct 13 09:54:34 BST 2015

Hello Zhe Cheng,

I am afraid it is a bit complicated.
Most SPEC benchmark cases (at least for the subset of integer suite) are
single thread programs.
Single thread programs can run in bare-metal mode (use pk) instead of
inside Linux, which is better for collecting miss rate and running time.
So I did it using pk on FPGA rather than booting a Linux.

You will need to modify the scripts from Speckle to compile for FPGA
runs (if I remember it right, use the linux gcc rather than newlib gcc).

pk has an option to report running time (actually it is the cycle count)
before exit.

For missing rate, I had added a lot of hardware performance counters in
L1/L2 to collect cache requests and misses.
Some of the code can be found in
https://github.com/lowrisc/lowrisc-chip.git branch "perform".
However, it is the version for the old code base and without L2.
I have not gotten time to update the latest code base yet.

I have also modified fesvr to read/report the performance counters
before exit.

If you need to run benchmarks in Linux, you do not need to install SPEC
(I believe) but it is still difficult for the following reasons:
1. When running benchmarks, the Linux kernel has booted, L1/L2 caches
are occupied (not empty).
2. Kernel may affect the running time.
3. There is no easy way of reporting the value of performance counters
as they are implemented as CSRs. Reading CSRs in user mode program needs
special syscalls added to the Linux kernel.
4. The timing (cycle count) function is also on the supervisor side.

But if the only thing interests you is the running time, in Linux should
be OK.
Oh, be careful the time you get from the FPGA Linux, which is not real time.
If I remember it right, there is a static configuration in the kernel to
define the clock frequency, with no regard to the real FPGA clock frequency.

Good luck!

On 13/10/2015 04:31, Zhe Cheng Lee wrote:
> Hello Wei,
> We want to measure the execution time in the FPGA using SPEC
> benchmarks.We have compiled the SPEC benchmark binaries with Speckle.
> After moving the SPEC binaries (the .riscv files e.g.
> bzip2_base.riscv, correct?) into the Linux root image and booting
> Linux at the rocket-chip FPGA, how exactly do we run the binaries in
> the booted Linux? Do we need to install SPEC in the FPGA? If so, how?
> Also, in the paper, there are measurements such as MPKI. Is this a
> measurement given by running SPEC alone, or is it a measurement you
> modelled?
> Thanks,
> -Zhe Cheng
> On Sat, Oct 10, 2015 at 12:11 PM, Wei Song <ws327 at cam.ac.uk
> <mailto:ws327 at cam.ac.uk>> wrote:
>     Hello Monjur,
>     I had run SPEC 2006 Integer cases on a Zedboard using the script
>     from Speckle, although not all cases.
>     You can have a look of the results in
>     http://wsong83.github.io/publication/comparch/riscv2015.pdf
>     These are the result results collected from FPGA runs.
>     Best regards,
>     Wei
>     On 09/10/15 22:19, Monjur Alam wrote:
>>     Hi Wei,
>>     I got your point. Answer to your question is No, it does not fill
>>     the cache with fake tag after reset. And, you are write, always
>>     miss happen at the the beginning just after reset. Thanks for
>>     pointing this.
>>     One more suggestion please; do you ever run SPEC CPU2006 on top
>>     of rocket-chip on FPGA. I have created a stackOverflow question
>>     (http://stackoverflow.com/questions/33004581/running-spec06-with-riscv-architecture).
>>     The Speckle provides a wrapper for that to run spike. But, spike
>>     has no connection with rocket-chip. I think, running CPU2006  on
>>     top of rocket-chip on FPGA will demonstrate the performance
>>     overhead of real architecture.
>>     Your opinion please.
>>     Regards,
>>     Monjur 
>>     On Tue, Oct 6, 2015 at 4:36 AM, Wei Song <ws327 at cam.ac.uk
>>     <mailto:ws327 at cam.ac.uk>> wrote:
>>         Hello Monjur,
>>         The reasoning for tag cache is to reduce the traffic to DRAM.
>>         In lowRISC, tags and data are stored separately in different
>>         DRAM partitions.
>>         So a miss in L1 will cause at least two DRAM reads (one for
>>         data and one for tag).
>>         The total DRAM traffic is increased by 100%.
>>         A tag cache is supposed to reduce the amount of tag traffic
>>         but does not help on data traffic.
>>         If in your case the tag cache is always hit, I am also
>>         wondering why there is this 22% overhead.
>>         However, a big tag cache does not guarantee hit.
>>         Is you tag cache kind of dummy, which I mean the tag cache
>>         provides fake tags without the need to fill empty cache lines
>>         even after reset?
>>         Otherwise, the tag cache is empty at the beginning and there
>>         will be compulsory misses after reset.
>>         Best regards,
>>         Wei
>>         On 05/10/2015 23:03, Monjur Alam wrote:
>>>         Hi Wei,
>>>         Thank you very much for your help through out by providing
>>>         valuable suggestion.
>>>         So far, we have implemented tag support of riscv for L1
>>>         (will add L2 later on). The architecture is (more or less
>>>         same as lowRisc):
>>>         0. Unlike lowRisk, we perform basic operations (load, store)
>>>         for data and tag parallel. 
>>>         1. Extend data cache 1 bit / double word
>>>         2. Added tag cache that resides between L1 and DRAM
>>>         3. Design a tagger module for making bridge between tagCache
>>>         and DDR3
>>>         But, we have seen that the performance is degraded around
>>>         22%; we have tested it by existing benchmarks. We are
>>>         planing to map the design into zc706 FPGA and to run SPEC
>>>         benchmark on our architecture.
>>>         1. As tag cache (32 MB) assure tag hit, why such performance
>>>         degradation (22%)?
>>>         2. Does tag cache conceptually help for data miss (not tag
>>>         miss). Because, data miss fetch DRAM, so completion of
>>>         operation depends on data fetch, not only tag even tag is
>>>         fetched from tag cache which is faster?
>>>         3. Do we really need tag cache, we can fetch tag from DRAM
>>>         like data.
>>>         Your suggestion please.
>>>         Regards,
>>>         Monjur
>>>         On Tue, Sep 22, 2015 at 4:27 AM, Wei Song <ws327 at cam.ac.uk
>>>         <mailto:ws327 at cam.ac.uk>> wrote:
>>>             Hello Zhe Cheng,
>>>             Actually extending tags in L2 is very simple.
>>>             L2 is ignorant to the content of cache lines. What you
>>>             need to do is to extend the size of data array.
>>>             TileLink is the communication fabric used internally in
>>>             Rocket.
>>>             Both the broadcasting hub and L2 use the same
>>>             TileLink/MemIO converter, you you do not need to revise
>>>             a new converter.
>>>             At start, HTIF writes program to L2. When L2 needs to
>>>             write back, some cache line is then written to memory
>>>             using the TileLink/MemIO converter.
>>>             Seems like you have made to broadcast one working already.
>>>             Best regards,
>>>             Wei
>>>             On 22/09/2015 00:54, Zhe Cheng Lee wrote:
>>>>             Hello Wei,
>>>>             Than you for your response. I was previously using a
>>>>             broadcast coherence hub instead of a L2, but now I have
>>>>             moved to using an L2 after verifying that tag bits can
>>>>             be stored to and loaded from the L1 caches fine in my
>>>>             modifications to the rocket chip. In this case, will
>>>>             the data be written from HTIF to L2 through a different
>>>>             converter? Is there a TileLink-to-L2 data converter?
>>>>             Best regards.
>>>>             -Zhe Cheng
>>>>             On Sat, Sep 19, 2015 at 9:15 AM, Wei Song
>>>>             <ws327 at cam.ac.uk <mailto:ws327 at cam.ac.uk>> wrote:
>>>>                 Hello Zhe Cheng,
>>>>                 I just noticed another issue which may or may not
>>>>                 cause the error.
>>>>                 Since you do not want to use the tag cache, I
>>>>                 assume you are using the original
>>>>                 MemIOUncachedTileLinkIOConverter to covert TileLink
>>>>                 messages to MemIO messages.
>>>>                 Also I assume you are using the broadcast coherence
>>>>                 hub instead of using a L2.
>>>>                 In this case, the data written from HTIF are always
>>>>                 written to memory through this MemIO/TileLinke
>>>>                 converter.
>>>>                 You need to remove tags for messages from TileLink
>>>>                 to MemIO and add tags for messages from MemIO to
>>>>                 TileLinks.
>>>>                 Tag cache does the conversion so I did not change
>>>>                 the code of this MemIO/TileLinke converter.
>>>>                 But some revision is needed in your case. Something
>>>>                 like what the HTIF and icache has been done.
>>>>                 The assembly seems from the dump file, which is
>>>>                 correct to my eyes.
>>>>                 The difference between trace file and dump file
>>>>                 would reveal more insights.
>>>>                 If you think the value load to gp is wrong, may be
>>>>                 have a look of the test case and try to figure out
>>>>                 what exactly wrong would help you debug.
>>>>                 I think it is the test case test_3 in
>>>>                 riscv-tests/isa/rv64ui/ld.S.
>>>>                 Best regards,
>>>>                 Wei
>>>>                 On 18/09/15 23:59, Zhe Cheng Lee wrote:
>>>>>                 Hi Wei,
>>>>>                 Thank you very much for your response. It is
>>>>>                 indeed complicated to get this to really work. I
>>>>>                 found your response helpful, though. I didn't
>>>>>                 consider HTIF before when modifying the current
>>>>>                 rocket chip. I can see why HTIF is imporant then.
>>>>>                 By control path, do you mean the control signals
>>>>>                 associated with the new instructions and the logic
>>>>>                 to handling them? If so, then yes, I have changed it.
>>>>>                 I added the tag utilities (I changed the data
>>>>>                 types in these tag function from Bits to UInt) and
>>>>>                 modified the corresponding lines in htif.scala
>>>>>                 accordingly to the changes in this commit
>>>>>                 <https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323a1a5af41#diff-228d7a2c10baa84f6595aeec2d50174b>
>>>>>                 to support tag memory, but the simulations still
>>>>>                 have not passed.
>>>>>                 As a side note, I added the changes in
>>>>>                 icache.scala to remove the tags at the line to be
>>>>>                 presented to the instruction cache as well, but
>>>>>                 when I compared, say, rv64ui-p-ld test .out
>>>>>                 simulated from the latest rocket-chip with the
>>>>>                 .out file from my changes to it, I noticed that
>>>>>                 the two PCs differ after several instructions when
>>>>>                 the program actually starts. When I revert back
>>>>>                 the changes in icache.scala (as in, removeTag
>>>>>                 doesn't get called), the two PCs start deviating
>>>>>                 later on instead of within the first few after the
>>>>>                 program starts. Does the L1 instruction caches not
>>>>>                 interact with HTIF?
>>>>>                 Without removing the tags in the instruction
>>>>>                 cache, the PCs begin to deviate after the branch
>>>>>                 instruction in:
>>>>>                  27c:   0080b183            ld  gp,8(ra)
>>>>>                  280:   ff010eb7            lui t4,0xff010
>>>>>                  284:   f01e8e9b            addiw   t4,t4,-255
>>>>>                  288:   010e9e93            slli    t4,t4,0x10
>>>>>                  28c:   f01e8e93            addi    t4,t4,-255 #
>>>>>                 ffffffffff00ff01 <_end+0xffffffffff00eee1>
>>>>>                  290:   010e9e93            slli    t4,t4,0x10
>>>>>                  294:   f00e8e93            addi    t4,t4,-256
>>>>>                  298:   00300e13            li  t3,3
>>>>>                  29c:   37d19c63            bne gp,t4,614 <fail>
>>>>>                 I am guessing the correct data isn't loaded to gp?
>>>>>                 How do I check this in the output file? I thought
>>>>>                 gp is the alias for register 31, but I don't see
>>>>>                 r31 around gp at that point.
>>>>>                 Thanks.
>>>>>                 On Fri, Sep 18, 2015 at 4:36 AM, Wei Song
>>>>>                 <ws327 at cam.ac.uk <mailto:ws327 at cam.ac.uk>> wrote:
>>>>>                     Hello Zhe Cheng,
>>>>>                     I think you are probably right on what is
>>>>>                     needed for supporting tags on
>>>>>                     the latest rocket repo.
>>>>>                     However, it is always complicated to make it
>>>>>                     really work.
>>>>>                     One thing I noticed is that you probably need
>>>>>                     to apply the changes to
>>>>>                     htif.scala as well if you have not done so.
>>>>>                     The tags are stored in a cache line in a way like
>>>>>                     [tag][word][tag][word]....
>>>>>                     The insertTag() and removeTag() in HTIF will
>>>>>                     make sure tag/data end up
>>>>>                     in the right interleaved position inside a
>>>>>                     cache line.
>>>>>                     Host interface (HTIF) is very important as the
>>>>>                     test programs (elf/hex)
>>>>>                     are written to memory/L2 through it.
>>>>>                     I think the host interface may have written
>>>>>                     totally unaligned program to
>>>>>                     memory due to the lack of insertTag() function.
>>>>>                     Also you need to revise the control path of
>>>>>                     the rocket core, which I
>>>>>                     think you have done so.
>>>>>                     For general debugging tips, you can compare
>>>>>                     the traces from simulation
>>>>>                     with the dump files of the test programs.
>>>>>                     Making sure the rocket processor is running
>>>>>                     the correct instructions
>>>>>                     would be my first check.
>>>>>                     BTW, I am working on bringing up a
>>>>>                     standard-alone lowRISC with tag
>>>>>                     support based on the latest Rocket chip.
>>>>>                     However, it is a slow process and I will need
>>>>>                     at least a couple of
>>>>>                     months on it.
>>>>>                     You will be able to run on a clean design if
>>>>>                     you can wait that long.
>>>>>                     Or if you would like to help, see the "update"
>>>>>                     branch of lowrisc-chip.git.
>>>>>                     I am working on peripherals now. Tag support
>>>>>                     is not added yet, so I can
>>>>>                     use some help to bring back tag support to the
>>>>>                     new code.
>>>>>                     Hope this is helpful,
>>>>>                     Wei
>>>>>                     On 18/09/2015 00:32, Zhe Cheng Lee wrote:
>>>>>                     > Hi, all,
>>>>>                     >
>>>>>                     > Has anyone successfully port lowRISC changes
>>>>>                     to support tagged memory to a
>>>>>                     > more updated version of the rocket chip
>>>>>                     repository (e.g. develop lowRISC
>>>>>                     > from a more updated version of the rocket
>>>>>                     chip repository)?
>>>>>                     >
>>>>>                     > I want to develop a design module that rely
>>>>>                     on those tagged memory bits and
>>>>>                     > are to be integrated with the most recent
>>>>>                     version of the rocket chip. At
>>>>>                     > this stage of my development process, I just
>>>>>                     want at least the L1 caches to
>>>>>                     > support tagged memory. In other words, I'm
>>>>>                     not concerned about including
>>>>>                     > the tag cache or supporting tagged memory in
>>>>>                     main memory right now. I'm
>>>>>                     > having trouble successfully pushing the tags
>>>>>                     into the L1 caches. I have
>>>>>                     > already added the load/store tag instruction
>>>>>                     decoding and encoding (I'm
>>>>>                     > aware that the order of the control signals
>>>>>                     in the decode table has been
>>>>>                     > changed a bit since the rocket-chip version
>>>>>                     lowRISC is based off of), the
>>>>>                     > new memory access type constant MT_T, and
>>>>>                     the necessary config parameters.
>>>>>                     >
>>>>>                     > At first, I thought I just need to include
>>>>>                     the highlighted modifications in
>>>>>                     > lowRISC's nbdcache.scala from
>>>>>                     >
>>>>>                     https://github.com/lowRISC/rocket/commit/51f65e2dce1bc60ef37c6da956bd8f9c8972961b#diff-de7e6f4be95f6d3b7e13d6c32e5c9783
>>>>>                     > and in its tilelink.scala from
>>>>>                     >
>>>>>                     https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323a1a5af41#diff-228d7a2c10baa84f6595aeec2d50174b
>>>>>                     > to the corresponding places in rocket-chip's
>>>>>                     nbdcache.scala, cache.scala,
>>>>>                     > and tilelink.scala. Even without the tag
>>>>>                     utilities and tag cache, this
>>>>>                     > should be fine just for testing existing
>>>>>                     instructions, since those tag bits
>>>>>                     > would just be ignored in those cases,
>>>>>                     correct? But with that, the
>>>>>                     > simulations do not pass the prebuilt tests
>>>>>                     and benchmarks that don't test
>>>>>                     > the load/store tag instructions.
>>>>>                     >
>>>>>                     > Can anyone help with this?
>>>>>                     >
>>>>>                     > Thanks.

