[lowrisc-dev] Open GPU for the first CPU
john.leidel at gmail.com
Mon Feb 9 08:19:34 GMT 2015
ALadar-V, we're working on a massively multithreaded RISC-V extension that makes use of HMC. The target audience is much different than a GPU though.
I would suggest that you outline a set of base requirements for your GPU ISA and see if it maps to any combination of RISC-V ISA's. At which point you can begin to gauge the cost of doing so in silicon.
--John D. Leidel
> On Feb 8, 2015, at 7:44 PM, ALadar-V <lowrisc-dev at aggregator.eu> wrote:
> I was thinking along the (GPU) lines you do a few weeks now. One
> question is whether RISC-V ISA is the right ISA for a GPU (I wish so)
> and whether some community of developers could be established to take
> care of development of a GPU. Something along Rocket cores line -
> personally, I was thinking about gradual rewrite of MESA 3d library into
> hardware, accelerating parts of it until it's a full-fledged GPU-V
> Looking at opencores.org under Video controller section, I can only find
> Wishbone graphics controller and VGA, nothing about 3d.
> Wouldn't it be great if there was a scalable GPU-V generator along the
> lines of Rocket? A generator that could generate GPU from input
> variables like number of blocks, ALUS, etc. Again, a spec would be
> needed for the GPU so the OpenCL and OpenGL would work on any generated
> With regards to moving data in a massive-register system, unless there
> is a spec for the "GPU ISA", we can't start to tinker with it albeit in
> a software simulator, because there's no benchmark code available. Same
> as with benchmarking Rocket cores.
> What I was personally thinking what to start with (with the Mesa lib)
> would be some rasterizer-glued accelerator of a VGA GPU to perform
> triangle rendering with antialiasing at high speed, offloading the CPU
> in Mesa library. Maybe tile-based. Step by step... With HBM or HMC
> memory, the bandwidth to off-chip memory can be there. And I just
> imagine a SoC with multiple high bandwidth interfaces having both GPU
> and CPU part. Something like i.e. Raspberry Pi (now version 2 :) on
>> On Tue, Jan 13, 2015 at 11:11:55PM +0100, Reinoud Zandijk wrote:
>> Hi Jerome,
>> On Tue, Jan 13, 2015 at 04:17:07PM -0500, Jerome Glisse wrote:
>>>> What about https://github.com/VerticalResearchGroup/miaow ?
>>> The 3 clauses BSD license is horrible, i wonder when people will
> stop using
>>> it. That said last time i checked that they only implemented very basic
>>> part. It was far away from being usefull or meaningfull.
>> I won't start a BSD license vs GPL discussion here :) Apart from that, it
>> BSD license without the advertisement clause are fine. BSD license with the advertisement clause is a pain to deal with, anyone who ever had to work on software distribution can testify on that (especialy if lawyer were involve). Most of time solution end up being not shipping the software and replacing it with something else.
>>> GPL did not make that mistake, and almost all major BSD licensed project do use the 2 clauses license (ie one without the advertisement clause). I am certainly not a BSD/GPL flamer. But i will definitly cry out loud to any body who use or consider using the 3 clauses license or any license with an advertisement clause.
>> Dne 8.2.2015 20:01, lowrisc-dev-request at lists.lowrisc.org napsal(a):
>> GPU are all about bandwidth, aligning compute unit one after the other is a pointless exercise. It is all about feeding compute unit with data to crunch. The "secret" of GPU is to have an order of magnitude more threads in flight than there is compute unit (10 times more on high GPU is a good approximation). Idea is that you will always have thread that are ready to perfom an operation on the floating point or integer ALU.
>> Having all the executions unit go to the same program is a mistake. On GPU you often have several programs in flight (in case of graphics some work on vertices others on pixels ...). Also the stack size you need to keep around to account for active/inactive thread is log2(#unit_same_program). So far both AMD and NVidia seems to have converge on 64 compute unit (each unit here being a simple
>> ALU capable of perfoming a single float or integer operation per cycle). With 64 threads you only need 6 qword stacks ie 48bytes.
>> Intel did try to do just as you said and it turn out to be one of there biggest flop (larrabee disaster).
>> So if anyone wants to design a GPU, the main thing is first figuring out how to get the biggest bandwidth you can at all level (memory fetch from main memory, or register file access). The compute unit themself or the instruction scheduler are not the most complex part, they are in fact the easy part as all tricks you can do to perform arithmetic operations or things like instruction cache and decoding are well know and well documented. But how to design a register file capable of delivering 1024bits or more per cycle is hard. Or a texturing unit
>> capable of filtering a texel in cycle and batching main memory access to maximize bandwidth and minimize cache miss is the hard part.
>> That is where most of the secret sauce is and you will not find many things in the litterature about those aspect.
>>> With regards,
More information about the lowrisc-dev