Why it issues: Groq is the hundredth startup to take a shot at making an AI accelerator card, the second to market, and the first to have a product attain the 1 quadrillion operations per second threshold. That’s quadruple the efficiency of Nvidia’s strongest card.
The Groq Tensor Streaming Processor (TSP) calls for 300W per core, so fortunately, it’s solely received one. Even luckier, Groq has turned that from an obstacle into the TSP’s biggest power.
You ought to in all probability throw all the pieces you realize about GPUs or AI processing out the window, as a result of the TSP is simply plain bizarre. It’s an enormous piece of silicon with nearly nothing however Vector and Matrix processing items and cache, so no controllers or backend in any way. The compiler has direct management.
The TSP is split into 20 superlanes. Superlanes are constructed from, so as of left to proper: a Matrix Unit (320 MACs), Switch Unit, Memory Unit (5.5 MB), Vector Unit (16 ALUs), Memory Unit (5.5 MB), Switch Unit, Matrix Unit (320 MACs). You’ll discover that the parts are mirrored round the Vector Unit, this divides the superlane into two hemispheres that may act nearly independently.
The instruction stream (there is just one) is fed into each element of superlane 0, with 6 directions for the Matrix Units, 14 for the Switch Units, 44 for the Memory Units, and 16 for the Vector Unit. Every clock cycle, the items carry out their operations and transfer the piece of information to the place it’s going subsequent inside the superlane. Each element can ship and obtain 512B from its next-door neighbors.
Once the superlane’s operations are full, it passes all the pieces all the way down to the subsequent superlane and receives no matter the superlane above (or the instruction controller) has. Instructions are all the time handed down vertically between the superlanes, whereas information solely transfers horizontally inside a superlane.
|Groq TSP||Nvidia Tesla V100||Nvidia Tesla T4|
|Maximum Frequency||1250 MHz||1530 MHz||1590 MHz|
|FP16 TFLOPS||205 TFLOPS||125 TFLOPS||65 TFLOPS|
|INT8 TOPS||1000 TOPS||250 TOPS||130 TOPS|
|Chip Cache (L1)||220 MB||10 MB||2.6 MB|
|Board Memory||N/A||32 GB HBM2||16 GB GDDR6|
|Board Power (TDP)||300W||300W||70W|
|Die Area||725 mm²||815 mm²||545 mm²|
All that makes for a processor that’s extraordinarily good at neural community coaching and inferencing, and incapable of anything. To put some benchmarks to it, in ResNet-50 it may carry out 20,400 Inferences per Second (I/S) at any batch dimension, with an inference latency of 0.05 ms.
Nvidia’s Tesla V100 can carry out 7,907 I/S at a batch dimension of 128, or 1,156 I/S at a batch dimension of one (batch sizes usually aren’t this low, nevertheless it demonstrates TSP’s versatility). Its latency at batch 128 is 16 ms and 0.87 ms at batch one. Obviously, the TSP outperforms Nvidia’s most equal card on this workload.
One of TSP’s strengths is that it has a lot L1 cache, nevertheless it additionally doesn’t have anything. If a neural community expands past that quantity or whether it is coping with very giant inputs, it can significantly undergo. Nvidia’s playing cards have gigabytes of reminiscence that may deal with that state of affairs.
This sums up the TSP very well. In particular workloads it’s greater than twice as highly effective as the Tesla V100, but when your workload varies, or if heaven-forbid you wish to do one thing with greater than half-precision, you’ll be able to’t. The TSP positively has a future in areas like self-driving automobiles, the place the quantity of enter is predictable, and the neural community will be assured to suit. In this case its spectacular latency, 320x higher than Nvidia’s, means the car can reply quicker.
The TSP is presently out there to pick out clients as an accelerator inside the Nimbix Cloud.