Wei,
Thanks for the response. I don't have access to my development environment
right now, so just to be sure: when you said to use pk on the FPGA, you
mean something like this (after booting the Zynq board and mounting the SD
card)?
./fesvr-zynq pk -c /sdcard/path/to/bzip2_base.riscv ${input} > /path/to/output
Speckle generated a run script for a portable directory containing the SPEC
benchmark binaries and input data compiled with RISC-V Linux GCC. So for
FPGA runs (not in booted Linux), instead of copying the directory into the
Linux root image, we would just copy it to a separate location in the SD
card and execute the run script. We would also need to modify the run
script so that instead of calling spike, it calls fesvr-zynq. Is this
correct?
Thanks,
-Zhe Cheng
On Tue, Oct 13, 2015 at 4:54 AM, Wei Song <ws327(a)cam.ac.uk> wrote:
Hello Zhe Cheng,
I am afraid it is a bit complicated.
Most SPEC benchmark cases (at least for the subset of integer suite) are
single thread programs.
Single thread programs can run in bare-metal mode (use pk) instead of
inside Linux, which is better for collecting miss rate and running time.
So I did it using pk on FPGA rather than booting a Linux.
You will need to modify the scripts from Speckle to compile for FPGA runs
(if I remember it right, use the linux gcc rather than newlib gcc).
pk has an option to report running time (actually it is the cycle count)
before exit.
For missing rate, I had added a lot of hardware performance counters in
L1/L2 to collect cache requests and misses.
Some of the code can be found in
https://github.com/lowrisc/lowrisc-chip.git branch "perform".
However, it is the version for the old code base and without L2.
I have not gotten time to update the latest code base yet.
I have also modified fesvr to read/report the performance counters before
exit.
If you need to run benchmarks in Linux, you do not need to install SPEC (I
believe) but it is still difficult for the following reasons:
1. When running benchmarks, the Linux kernel has booted, L1/L2 caches are
occupied (not empty).
2. Kernel may affect the running time.
3. There is no easy way of reporting the value of performance counters as
they are implemented as CSRs. Reading CSRs in user mode program needs
special syscalls added to the Linux kernel.
4. The timing (cycle count) function is also on the supervisor side.
But if the only thing interests you is the running time, in Linux should
be OK.
Oh, be careful the time you get from the FPGA Linux, which is not real
time.
If I remember it right, there is a static configuration in the kernel to
define the clock frequency, with no regard to the real FPGA clock frequency.
Good luck!
Wei
On 13/10/2015 04:31, Zhe Cheng Lee wrote:
Hello Wei,
We want to measure the execution time in the FPGA using SPEC benchmarks.
We have compiled the SPEC benchmark binaries with Speckle. After moving the
SPEC binaries (the .riscv files e.g. bzip2_base.riscv, correct?) into the
Linux root image and booting Linux at the rocket-chip FPGA, how exactly do
we run the binaries in the booted Linux? Do we need to install SPEC in
the FPGA? If so, how?
Also, in the paper, there are measurements such as MPKI. Is this a
measurement given by running SPEC alone, or is it a measurement you
modelled?
Thanks,
-Zhe Cheng
On Sat, Oct 10, 2015 at 12:11 PM, Wei Song <ws327(a)cam.ac.uk> wrote:
> Hello Monjur,
>
> I had run SPEC 2006 Integer cases on a Zedboard using the script from
> Speckle, although not all cases.
> You can have a look of the results in
> <
http://wsong83.github.io/publication/comparch/riscv2015.pdf>
>
http://wsong83.github.io/publication/comparch/riscv2015.pdf
> These are the result results collected from FPGA runs.
>
> Best regards,
> Wei
>
>
> On 09/10/15 22:19, Monjur Alam wrote:
>
> Hi Wei,
>
> I got your point. Answer to your question is No, it does not fill the
> cache with fake tag after reset. And, you are write, always miss happen at
> the the beginning just after reset. Thanks for pointing this.
>
> One more suggestion please; do you ever run SPEC CPU2006 on top of
> rocket-chip on FPGA. I have created a stackOverflow question (
>
<
http://stackoverflow.com/questions/33004581/running-spec06-with-riscv-arc...
>
http://stackoverflow.com/questions/33004581/running-spec06-with-riscv-arc...).
> The Speckle provides a wrapper for that to run spike. But, spike has no
> connection with rocket-chip. I think, running CPU2006 on top of
> rocket-chip on FPGA will demonstrate the performance overhead of real
> architecture.
>
> Your opinion please.
>
> Regards,
> Monjur
>
> On Tue, Oct 6, 2015 at 4:36 AM, Wei Song < <ws327(a)cam.ac.uk>
> ws327(a)cam.ac.uk> wrote:
>
>> Hello Monjur,
>>
>> The reasoning for tag cache is to reduce the traffic to DRAM.
>> In lowRISC, tags and data are stored separately in different DRAM
>> partitions.
>> So a miss in L1 will cause at least two DRAM reads (one for data and one
>> for tag).
>> The total DRAM traffic is increased by 100%.
>> A tag cache is supposed to reduce the amount of tag traffic but does not
>> help on data traffic.
>>
>> If in your case the tag cache is always hit, I am also wondering why
>> there is this 22% overhead.
>> However, a big tag cache does not guarantee hit.
>> Is you tag cache kind of dummy, which I mean the tag cache provides fake
>> tags without the need to fill empty cache lines even after reset?
>> Otherwise, the tag cache is empty at the beginning and there will be
>> compulsory misses after reset.
>>
>> Best regards,
>> Wei
>>
>>
>> On 05/10/2015 23:03, Monjur Alam wrote:
>>
>> Hi Wei,
>>
>> Thank you very much for your help through out by providing valuable
>> suggestion.
>>
>> So far, we have implemented tag support of riscv for L1 (will add L2
>> later on). The architecture is (more or less same as lowRisc):
>>
>> 0. Unlike lowRisk, we perform basic operations (load, store) for data
>> and tag parallel.
>> 1. Extend data cache 1 bit / double word
>> 2. Added tag cache that resides between L1 and DRAM
>> 3. Design a tagger module for making bridge between tagCache and DDR3
>>
>> But, we have seen that the performance is degraded around 22%; we have
>> tested it by existing benchmarks. We are planing to map the design into
>> zc706 FPGA and to run SPEC benchmark on our architecture.
>>
>> 1. As tag cache (32 MB) assure tag hit, why such performance degradation
>> (22%)?
>> 2. Does tag cache conceptually help for data miss (not tag miss).
>> Because, data miss fetch DRAM, so completion of operation depends on data
>> fetch, not only tag even tag is fetched from tag cache which is faster?
>> 3. Do we really need tag cache, we can fetch tag from DRAM like data.
>>
>> Your suggestion please.
>>
>> Regards,
>> Monjur
>>
>>
>> On Tue, Sep 22, 2015 at 4:27 AM, Wei Song < <ws327(a)cam.ac.uk>
>> ws327(a)cam.ac.uk> wrote:
>>
>>> Hello Zhe Cheng,
>>>
>>> Actually extending tags in L2 is very simple.
>>> L2 is ignorant to the content of cache lines. What you need to do is to
>>> extend the size of data array.
>>> TileLink is the communication fabric used internally in Rocket.
>>> Both the broadcasting hub and L2 use the same TileLink/MemIO converter,
>>> you you do not need to revise a new converter.
>>> At start, HTIF writes program to L2. When L2 needs to write back, some
>>> cache line is then written to memory using the TileLink/MemIO converter.
>>> Seems like you have made to broadcast one working already.
>>>
>>> Best regards,
>>> Wei
>>>
>>>
>>> On 22/09/2015 00:54, Zhe Cheng Lee wrote:
>>>
>>> Hello Wei,
>>>
>>> Than you for your response. I was previously using a broadcast
>>> coherence hub instead of a L2, but now I have moved to using an L2 after
>>> verifying that tag bits can be stored to and loaded from the L1 caches fine
>>> in my modifications to the rocket chip. In this case, will the data be
>>> written from HTIF to L2 through a different converter? Is there a
>>> TileLink-to-L2 data converter?
>>>
>>> Best regards.
>>> -Zhe Cheng
>>>
>>> On Sat, Sep 19, 2015 at 9:15 AM, Wei Song < <ws327(a)cam.ac.uk>
>>> ws327(a)cam.ac.uk> wrote:
>>>
>>>> Hello Zhe Cheng,
>>>>
>>>> I just noticed another issue which may or may not cause the error.
>>>> Since you do not want to use the tag cache, I assume you are using the
>>>> original MemIOUncachedTileLinkIOConverter to covert TileLink messages to
>>>> MemIO messages.
>>>> Also I assume you are using the broadcast coherence hub instead of
>>>> using a L2.
>>>> In this case, the data written from HTIF are always written to memory
>>>> through this MemIO/TileLinke converter.
>>>> You need to remove tags for messages from TileLink to MemIO and add
>>>> tags for messages from MemIO to TileLinks.
>>>>
>>>> Tag cache does the conversion so I did not change the code of this
>>>> MemIO/TileLinke converter.
>>>> But some revision is needed in your case. Something like what the HTIF
>>>> and icache has been done.
>>>>
>>>> The assembly seems from the dump file, which is correct to my eyes.
>>>> The difference between trace file and dump file would reveal more
>>>> insights.
>>>> If you think the value load to gp is wrong, may be have a look of the
>>>> test case and try to figure out what exactly wrong would help you debug.
>>>> I think it is the test case test_3 in riscv-tests/isa/rv64ui/ld.S.
>>>>
>>>> Best regards,
>>>> Wei
>>>>
>>>>
>>>> On 18/09/15 23:59, Zhe Cheng Lee wrote:
>>>>
>>>> Hi Wei,
>>>>
>>>> Thank you very much for your response. It is indeed complicated to get
>>>> this to really work. I found your response helpful, though. I didn't
>>>> consider HTIF before when modifying the current rocket chip. I can see
why
>>>> HTIF is imporant then.
>>>>
>>>> By control path, do you mean the control signals associated with the
>>>> new instructions and the logic to handling them? If so, then yes, I have
>>>> changed it.
>>>>
>>>> I added the tag utilities (I changed the data types in these tag
>>>> function from Bits to UInt) and modified the corresponding lines in
>>>> htif.scala accordingly to the changes in this commit
>>>>
<
https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323...
>>>> to support tag memory, but the simulations still have not passed.
>>>>
>>>> As a side note, I added the changes in icache.scala to remove the tags
>>>> at the line to be presented to the instruction cache as well, but when I
>>>> compared, say, rv64ui-p-ld test .out simulated from the latest
rocket-chip
>>>> with the .out file from my changes to it, I noticed that the two PCs
differ
>>>> after several instructions when the program actually starts. When I
revert
>>>> back the changes in icache.scala (as in, removeTag doesn't get
called), the
>>>> two PCs start deviating later on instead of within the first few after
the
>>>> program starts. Does the L1 instruction caches not interact with HTIF?
>>>>
>>>> Without removing the tags in the instruction cache, the PCs begin to
>>>> deviate after the branch instruction in:
>>>>
>>>> 27c: 0080b183 ld gp,8(ra)
>>>> 280: ff010eb7 lui t4,0xff010
>>>> 284: f01e8e9b addiw t4,t4,-255
>>>> 288: 010e9e93 slli t4,t4,0x10
>>>> 28c: f01e8e93 addi t4,t4,-255 # ffffffffff00ff01
>>>> <_end+0xffffffffff00eee1>
>>>> 290: 010e9e93 slli t4,t4,0x10
>>>> 294: f00e8e93 addi t4,t4,-256
>>>> 298: 00300e13 li t3,3
>>>> 29c: 37d19c63 bne gp,t4,614 <fail>
>>>>
>>>> I am guessing the correct data isn't loaded to gp? How do I check
this
>>>> in the output file? I thought gp is the alias for register 31, but I
don't
>>>> see r31 around gp at that point.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Fri, Sep 18, 2015 at 4:36 AM, Wei Song < <ws327(a)cam.ac.uk>
>>>> ws327(a)cam.ac.uk> wrote:
>>>>
>>>>> Hello Zhe Cheng,
>>>>>
>>>>> I think you are probably right on what is needed for supporting tags
>>>>> on
>>>>> the latest rocket repo.
>>>>> However, it is always complicated to make it really work.
>>>>>
>>>>> One thing I noticed is that you probably need to apply the changes
to
>>>>> htif.scala as well if you have not done so.
>>>>> The tags are stored in a cache line in a way like
>>>>> [tag][word][tag][word]....
>>>>>
>>>>> The insertTag() and removeTag() in HTIF will make sure tag/data end
up
>>>>> in the right interleaved position inside a cache line.
>>>>>
>>>>> Host interface (HTIF) is very important as the test programs
(elf/hex)
>>>>> are written to memory/L2 through it.
>>>>> I think the host interface may have written totally unaligned
program
>>>>> to
>>>>> memory due to the lack of insertTag() function.
>>>>>
>>>>> Also you need to revise the control path of the rocket core, which I
>>>>> think you have done so.
>>>>>
>>>>> For general debugging tips, you can compare the traces from
simulation
>>>>> with the dump files of the test programs.
>>>>> Making sure the rocket processor is running the correct instructions
>>>>> would be my first check.
>>>>>
>>>>> BTW, I am working on bringing up a standard-alone lowRISC with tag
>>>>> support based on the latest Rocket chip.
>>>>> However, it is a slow process and I will need at least a couple of
>>>>> months on it.
>>>>> You will be able to run on a clean design if you can wait that long.
>>>>> Or if you would like to help, see the "update" branch of
>>>>> lowrisc-chip.git.
>>>>> I am working on peripherals now. Tag support is not added yet, so I
>>>>> can
>>>>> use some help to bring back tag support to the new code.
>>>>>
>>>>> Hope this is helpful,
>>>>> Wei
>>>>>
>>>>>
>>>>> On 18/09/2015 00:32, Zhe Cheng Lee wrote:
>>>>> > Hi, all,
>>>>> >
>>>>> > Has anyone successfully port lowRISC changes to support tagged
>>>>> memory to a
>>>>> > more updated version of the rocket chip repository (e.g.
develop
>>>>> lowRISC
>>>>> > from a more updated version of the rocket chip repository)?
>>>>> >
>>>>> > I want to develop a design module that rely on those tagged
memory
>>>>> bits and
>>>>> > are to be integrated with the most recent version of the rocket
>>>>> chip. At
>>>>> > this stage of my development process, I just want at least the
L1
>>>>> caches to
>>>>> > support tagged memory. In other words, I'm not concerned
about
>>>>> including
>>>>> > the tag cache or supporting tagged memory in main memory right
now.
>>>>> I'm
>>>>> > having trouble successfully pushing the tags into the L1 caches.
I
>>>>> have
>>>>> > already added the load/store tag instruction decoding and
encoding
>>>>> (I'm
>>>>> > aware that the order of the control signals in the decode table
has
>>>>> been
>>>>> > changed a bit since the rocket-chip version lowRISC is based
off
>>>>> of), the
>>>>> > new memory access type constant MT_T, and the necessary config
>>>>> parameters.
>>>>> >
>>>>> > At first, I thought I just need to include the highlighted
>>>>> modifications in
>>>>> > lowRISC's nbdcache.scala from
>>>>> >
>>>>>
<
https://github.com/lowRISC/rocket/commit/51f65e2dce1bc60ef37c6da956bd8f9c...
>>>>>
https://github.com/lowRISC/rocket/commit/51f65e2dce1bc60ef37c6da956bd8f9c...
>>>>> > and in its tilelink.scala from
>>>>> >
>>>>>
<
https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323...
>>>>>
https://github.com/lowRISC/uncore/commit/cebfde6d42b7465cab79518fad91e323...
>>>>> > to the corresponding places in rocket-chip's
nbdcache.scala,
>>>>> cache.scala,
>>>>> > and tilelink.scala. Even without the tag utilities and tag
cache,
>>>>> this
>>>>> > should be fine just for testing existing instructions, since
those
>>>>> tag bits
>>>>> > would just be ignored in those cases, correct? But with that,
the
>>>>> > simulations do not pass the prebuilt tests and benchmarks that
>>>>> don't test
>>>>> > the load/store tag instructions.
>>>>> >
>>>>> > Can anyone help with this?
>>>>> >
>>>>> > Thanks.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>