Groq: Language Processing Unit
Some company called Groq (not be be confused with Grok, the LLM created by the company formerly known as Twitter), which claims to have made an LPU -- "language processing unit". A piece of hardware, analogous to a GPU, but specialized to optimize for large language models. Not specialized for just neural networks -- those already exist and have existed for years and are called TPUs (tensor processing units) -- but specialized for large language models specifically.
"An LPU Inference Engine, with LPU standing for Language Processing Unit, is a new type of processing system invented by Groq to handle computationally intensive applications with a sequential component to them such as LLMs."
"An LPU inference engine has the following characteristics: Exceptional sequential performance, Single core architecture, Synchronous networking that is maintained even for large scale deployments, Ability to auto-compile >50B LLMs, Instant memory access, and High accuracy that is maintained even at lower precision levels."
"How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user."
What they've done here is restructure the chips. Traditionally chips, including both CPUs and GPUs, and now TPUs (tensor processing units) have the concept of "cores" -- processing units that retrieve data from memory, perform some computation, and store the results back. With CPUs you have fewer of them and they are very complex and general-purpose. With a GPU or TPU you have more cores, they are simpler, and more special-purpose.
Here, though, they do away with the concept of "cores" entirely, and embrace the concept of "streams" instead. The chips are structured in such a way as memory retrieval circuitry is on the sides, and once the data is retrieved, is flows in a "stream" from the left side of the chip to the right side. Actually, they do it bi-directionally, so there is data flowing from the right side to the left side as well, but for the sake of keeping the explanation simple, it's ok if you just picture data flowing from the left side to the right side. When the data gets to the right side, the results are moved off the chip to memory.
But doesn't that mean you've baked all your computations that will be done into the hardware in between the memory retrieval on the left side and the memory storage on the right side? That would be the case except along the bottom, they have an "instruction control and dispatch" circuit that extends across the entire width of the chip. This feeds information into every column, telling it what computations it is to perform. The options are "matrix" operations -- this means matrix multiply and accumulate -- "vector" operations, which are your activation functions such as ReLU, TanH, and so on, "switch" operations, which are things like matrix transposition, and, there are additional instructions that relate to technical details of the chips and the streams. But the matrix, vector, and switch operations are the heart of the system.
Not only did they rethink the architecture of the chips, but also the PCs and datacenters that they are a part of. The chips reside on specially constructed PCIe cards. The PCs that they reside in have a network architecture that is called Dragonfly topology, that is designed to eliminate routers and anything else that can introduce anything that is not deterministic. The idea here is to make the entire network deterministic so all of the the chips on all the PCIe boards can run in a synchronized manner. This actually surprised me, as it seems to be going in the opposite direction of just about all the chip design I've heard of -- usually you want chips to be able to handle whatever data shows up whenever it shows up, and have as little in the way of synchronous execution requirements as possible. There's a system clock for the whole chip, that controls the speed of the transistors, but beyond that I was under the impression chips are built to minimize exact timing requirements. Here they went in the opposite direction and maximized it.