Groq, and the Hardware of AI — Intuitively and Exhaustively Explained

An analysis of the major pieces of computer hardware used to run AI, along with a new heavy hitter.

Apr 12, 2024

∙ Paid

“Coordination Deconstructed” By Daniel Warfield using MidJourney. All images by the author unless otherwise specified. All generated images made with MidJourney.

This article discusses Groq, a new approach to computer hardware that’s revolutionizing the way AI is applied to real world problems.

Before we talk about Groq, we’ll break down what AI fundamentally is, and explore some of the key components of computer hardware used to run AI models. Namely; CPUs, GPUs, and TPUs. We’ll explore these critical pieces of hardware by starting in 1975 with the Z80 CPU, then we’ll build up our understanding to modern systems by exploring some of the critical evolutions in computer hardware.

Armed with an understanding of some of the fundamental concepts and tradeoffs in computer hardware, we’ll use that understanding to explore what Groq is, how it’s revolutionizing the way AI computation is done, and why that matters.

Naturally there’s a lot to cover between early CPUs and a cutting edge billion dollar AI startup. Thus, this is a pretty long article. Buckle up, it’ll be worth it.

Real time recording of me communicating with Mixtral 8x7b, an open source large language model, hosted on Groq hardware. As you can see, Groq is capable of remarkably fast responses, even with very large language models. You can chat with Groq here.

Who is this useful for? Anyone interested in artificial intelligence, and the realities of what it takes to run AI models.

How advanced is this post? This post contains cutting edge ideas from a cutting edge AI startup, and explains them assuming no prior knowledge. It’s relevant to readers of all levels.

Pre-requisites: None, but there is a curated list of resources at the end of the article for related reading.

Disclaimer 1: this article isn’t about Elon Musk’s chat model “Grok”. Groq and Grok are completely unrelated, besides the fact that their names are based on the same book.

Disclaimer 2: During the time of writing I am not affiliated with Groq in any way. All opinions are my own and are unsponsored. Also, thank you to the Groq team for clarifying technical details and pointing me in the right direction. Specifically, thanks to Andrew Ling, VP of software engineering at Groq. I requested a few meetings with Andrew, and he was gracious enough to help me untangle some of the more subtle nuances of Groq’s hardware. Without those conversations this article wouldn’t have been possible.

Defining AI

Many people think of AI as a black box. You put stuff in, the AI does some obscure math, and you get stuff out.

A common intuition of AI; essentially a magic box that takes in stuff and outputs stuff.

This intuition is popular because it’s sometimes difficult for a human to understand a model’s thought process.

Just because an AI model is good at a task doesn’t mean that humans can readily understand a models decisions.

Even though it can be difficult to understand the rationale behind a models decision as a whole, under the hood AI models employ very simple mathematical operations to come to their conclusions.

A conceptual diagram of how AI thinks. The model learns parameters (highlighted in blue and underlined) that come to good decisions by looking at a bunch of data.

In other words, the reason AI models are complicated isn’t because they do complicated things, its because they do a ton of simple things, all at once.

There are a lot of hardware options that can be used to run AI. Let’s start by describing the most fundamental one.

The CPU

The main component of most modern computers is the CPU, or “Central Processing Unit”. CPUs are the beating heart of virtually every modern computer.

An image of an intel Xeon CPU. source, granted by intel's terms of use.

At its most fundamental, the CPU is based on the “Von Neumann architecture”.

Essentially, a Von Neumann device can accept some input, and uses a control unit to control how calculations get done in some arithmetic unit. Then, these calculations result in some meaningful output. source.

The Von Neumann architecture is pretty abstract; there’s a lot of leeway in terms of putting it into practice. Pretty much all the hardware we’ll be discussing in this article can be thought of as particular flavors of a Von Neumann device, including the CPU.

A popular early computer, the ZX Spectrum, employed the Z80 CPU to get stuff done. Conceptually, modern CPUs aren’t very different than the Z80, so we can use the Z80 as a simplified example to begin understanding how CPUs work.

The ZX Spectrum computer (left) which employed the Z80 CPU (right). Source.

Even the diagram for this humble CPU is fairly complex, but we can pick it apart to get an idea of some of the core components, which largely persist into modern CPUs.

An approximated block diagram of the Z80 CPU. Source.

The Z80 featured a control circuit, which converted low level instructions into actual actions within the chip, as well as kept track of book keeping things, like what commands the CPU was supposed to do.

The control circuit within the Z80 CPU. Source.

The Z80 featured an “arithmetic logic unit” (or ALU for short) which was capable of doing a variety of basic arithmetic operations. This is the thing that really did a lot of the actual computing within the Z80 CPU. The Z80 would get a few pieces of data into the input of the ALU, then the ALU would add them, multiply them, divide them, or do some other operation based on the current instruction being run by the CPU.

The arithmetic logic unit (or ALU) within the Z80. Source.

Virtually any complex mathematical function can be divided into simple steps. The ALU is designed to be able to do the most fundamental basic math, meaning a CPU is capable of very complex math by using the ALU to do many simple operations.

Even the most complex math can usually be broken up into many simple calculations. This integral (which is from calculus) is just multiplication, division, and addition.

The Z80 also contained a bunch of registers. Registers are tiny, super fast pieces of memory that exist within the CPU to store certain key pieces of information like which instruction the CPU is currently running, numerical data, addresses to data outside the CPU, etc.

When one thinks of a computer it’s easy to focus on circuits doing math, but in reality a lot of design work needs to go into where data gets stored. The question of how data gets stored and moved around is a central topic in this article, and plays a big part as to why modern computing relies on so many different specialized hardware components.

The CPU needs to talk with other components in the computer, which is the job of the busses. The Z80 CPU had three busses:

The Address Bus communicated data locations the Z80 was interested in
The Control Bus communicated what the CPU wanted to do
The Data Bus communicated actual data coming to and from the CPU

So, for instance, if the Z80 wanted to read some data from RAM and put that information onto a local register for calculation, it would use the address bus to communicate what data it was interested in, then it would use the control bus to communicate that it wanted to read data, then it would receive that data over the data bus.

The whole point of this whole song and dance is to allow the CPU to perform the “Fetch, Decode, Execute” cycle. The CPU “fetches” some instruction, then it “decodes” that instruction into actual actions for specific components in the CPU to undertake, then the CPU “executes” those actions. The CPU then fetches a new instruction, restarting the cycle.

The Fetch, Decode, Execute cycle, which is essentially all a CPU does.

This cycle works in consort with a program. Humans often think of programs as written in a programming language like Java or Python, but after the text of a program is interpreted by a compiler into machine code, and that machine code is transmitted to the CPU, a program ends up looking very different. Essentially, the compiler turns the program written by a human into a list of instructions which the CPU can perform based on it’s predefined control logic.

A conceptual diagram of a compiler converting a human written program into a list of instructions the CPU can understand. These instructions do simple operations, like add two numbers together, load data from RAM into a register, etc.

Once the code has been compiled, the CPU simply fetches an instruction, decodes it into actions within the CPU, then executes those actions. The CPU keeps track of where it is with a program counter, which usually increments each time an instruction is called, but it might also jump around the program based on some logic, like an if statement.

And that’s basically it. It turns out, even a simple CPU is capable of doing pretty much any calculation imaginable just by following a series of simple instructions. The trick, really, is getting the CPU to do those instructions quickly.

Design Constraints of the CPU

The Z80 was a fairly simplistic CPU. For one thing, it was a single “core”. The actual specifics of a core can get a little complicated, but a core is essentially a thing that does work on a CPU. Imagine instead of one Z80, we had a few Z80s packed together on a single chip, all doing their own thing. That’s essentially what a modern multi core CPU is.

A block diagram of the major components of a modern CPU. Each Core is like a complex version of the Z80 CPU we previously discussed. Having multiple cores allows modern computers to work on several things at once.

It is possible for the cores in a CPU to talk with each other, but to a large extent they usually are responsible for different things. These “different things” are called a “process”. A process, in formal computer speech, is a program and memory which exists atomically. You can have multiple processes on different Cores, and they generally won’t talk to one another.

A conceptual diagram of separate processes, each doing their own thing, each with their own data, essentially completely isolated from one another.

Chrome actually uses a separate process for each tab. That’s why, when one tab crashes, other tabs are not affected. Each of the tabs is a completely separate process. That’s also, to a large extent, why chrome consumes so much memory on a computer; all of these tabs are operating in near isolation from one another, which means they each need to keep track of a bunch of information.

If you’ve ever looked at chrome in task manager on windows, you’ve probably noticed numerous separate subtasks. That’s because chrome runs a separate process for each tab.

Sometimes it’s useful for a computer to be able to break up multiple calculations within a single program and run them in parallel. That’s why there are usually multiple “threads” within each core of a CPU. Threads can share (and cooperatively work together) on the same address space in memory.

Each Core of the CPU consists of multiple threads. The details get a little complicated because modern CPUs are complicated. Threads typically share some resources within the CPU (like the ALU), but also have their own resources (like registers), allowing for faster computation when certain tasks can be executed in parallel. The specifics past this point don’t matter too much for our purposes.

Threads are useful if you have a bunch of calculations which don’t depend on one another within a program. Instead of doing things back to back, you can do calculations in parallel across multiple threads.

A Single process can employ multiple threads within a core, allowing a single process to do some parallel operations.

So, a CPU can have multiple cores to run separate processes, and each of those cores have threads to allow for some level of parallel execution. Also CPUs contain computational units (ALUs) that can do pretty much any calculation imaginable. So.. We’re done right? We just need to make bigger and more powerful CPUs and we can do anything.

Not quite.

As I previously mentioned, CPUs are the beating heart of a computer. A CPU has to be able to do any of the arbitrary calculations necessary to run any program, and it has to be able to do those calculations quickly to keep your computer's response time near instantaneous.

To synchronize the rapid executions in a CPU, your computer has a quartz clock. A quartz clock ticks at steady intervals, allowing your computer to keep operations in an orderly lock step.

The quartz clock on the bottom of a raspberry pi, a popular single board computer. The point of the quartz clock is to provide steady oscillations used to synchronize execution throughout the computer.

These clocks are crazy fast. On the raspberry pi pictured above the clock oscillates at 19.2MHz, but modern CPUs can reach into the Gigahertz range. For those not familiar with these units, 1 Gigahertz is one billion oscillations per second, and with each of those oscillations the CPU is expected to fetch and execute an instruction for every core. Here’s a picture of a billion things, for reference:

Earth contains 7.8 Billion people. A modern CPU executes around that many instructions every second for every core. Modern CPUs typically have numerous cores. Source.

So, CPUs do things fast. They run based on a clock that’s oscillating billions of times per second. In fact, that’s so fast that the speed of light (the speed limit of the universe) starts coming into play.

The speed of electricity through silicone is around 60 million meters per second. a Typical CPU clock might oscillate 3 billion times per second. That would mean that electricity can only travel 20mm (less than an inch) within a CPU for every clock tick.

If CPUs get much bigger than they currently are then designers will run into serious challenges in terms of keeping operations across the chip synchronized. Imagine one part of a CPU trying to do addition, but one of the values didn’t fully arrive yet because the source of that information is over an inch away. This issue is certainly solvable… With more components, which take up space within the CPU, further exasperating the issue.

There’s other issues that I won’t go into; the cost of manufacturing square chips on round silicone wafers, intricacies in terms of cooling, etc. Basically, “just making CPUs bigger” has a lot of serious issues.

Another serious design consideration arises when we consider what CPUs are chiefly responsible for.

I mentioned CPUs have to be fast. More specifically, they need to have incredibly low latency. latency, generally, is the amount of time something takes from starting to ending. When you’re running a program sequentially the amount of latency in running each execution has a big impact on performance.

If a program has some number of tasks that need to be performed, each of those tasks might require some setup or tare down time (blue), as well as the amount of time necessary to actually do the computation (red).

So, the CPU attempts to minimize latency as much as possible. The cores in a CPU need to be blazingly fast, the data needs to be right there, ready to go. There’s not a lot of wiggle room.

A CPU designers main job is to get tasks done faster, and require less overhead cost to start and finish them.

CPUs optimize for latency in a lot of ways. They have a special type of memory called Cache, which is designed to store important data close to the CPU so that it can be retrieved quickly.

CPUs contain various levels of cache, each with different sizes and throughput speeds. Generally, cache close to the cores is smaller and faster, and cache further from the cores is larger and slower. The CPU designers do a lot of work to try to make data as accessible to the cores as possible.

CPUs employ a lot of other fancy technologies that, frankly, I don’t fully understand. Intel spends approximately $16 Billion annually on research and development. A decent chunk of that goes into squeezing as much performance out of this rock as possible. For CPUs, that means more Cores, more threads, and lower latency.

Focusing on faster and faster CPUs was the focal point of computation until the 90s. Then, suddenly, a new type of consumer emerged with a new type of performance requirement.

Enter Gamers

With video games, CPUs don’t stand a chance.

CPUs are great for running sequential programs, and even have some ability to parallelize computation, but to render video on a screen one needs to render millions of pixels multiple times a second.

On top of that, the values of pixels are based on 3D models, each of which might be made of thousands of polygons. A lot of independent calculations need to be done to turn video games into actual video, a very different use case than the rapid and sequential tasks the CPU was designed to handel.

A 3D scene consisting of (from left to right) a camera, a light, a sphere, a monkey, and a cube.

A simple scene rendered from the camera's perspective. You might be able to imagine the calculations required to generate an image like this. The computer needs to calculate which parts of each model are visible to the camera, and what color they should be based on the light source. For complex scenes with numerous models, light sources, and effects, this can get computationally expensive in a hurry.

There were a variety of chips designed to help the CPU in handling this new load. The most famous was the GPU.

The Origins of the GPU

The first mainstream GPU was created by Nvidia — The GeForce 256.

Nvidia’s GeForce 256, marketed as “The world's first GPU”. Source.

The idea behind this particular device was to offload expensive graphical processing onto purpose built hardware. The GeForce 256 was a very rigid and specific machine, designed to handle the “graphics pipeline”. The CPU would give the 256 a bunch of information about 3D models, materials, etc. and the GPU would do all the processing necessary to generate an image onto the screen.

The Graphics Pipeline, essentially a set of operations that take in information about some 3D scene, and turns them into images that can be rendered on a screen.

The 256 did this with specialized chips designed to do very specific operations; hardware for moving models in 3D space, calculating information about lighting, calculating if this was on top of that, etc. We don’t need to get too into the weeds of computer graphics, and can cut straight to the punchline: this specialized hardware improved the framerate of some games by up to 50%, which is a pretty monumental performance increase.

Naturally the first ever GPU wasn’t perfect. In being so rigidly designed it didn’t have a lot of flexibility to meet different needs for different applications. Nine years after the GeForce 256, Nvidia released the first modern GPU, which improved on the original GPU in many ways.

Modern GPUs

The Nvidia GeForce 8 was the first modern GPU, and really set the stage for what GPUs are today.

Instead of employing components designed to do specific graphical operations, the GeForce 8 series was a lot more like a CPU. It had general purpose cores which could do arbitrary calculations.

A conceptual diagram of a CPU vs a modern GPU. GPUs are a lot like CPUs, but are designed with a few key differences.

However, instead of focusing on low latency calculations for sequential programs, the GeForce 8 focused on a high throughput for parallel computation. In other words, CPUs do things back to back really quickly. GPUs are designed to do things a bit more slowly, but in parallel. This is a distinction between the CPU and the GPU which persists to this day.

The CPU focuses on problems that require sequential execution. You need to know the result of the previous calculation before running the next calculation. Thus, the CPU focuses on completing each instruction as quickly as possible. The GPU focuses on problems that can be parallelized. The GPU might be a bit slower to do a given operation, but is optimized to allow numerous parallel calculations to happen efficiently.

The GPU achieved parallel computation by employing the “Single Instruction, Multiple Data” (SIMD) approach to computation, allowing multiple cores to be controlled simultaneously. Also, by not caring too much about the latency of a particular calculation, the GPU can get away with a bit more setup time, allowing for the overhead to set up numerous computations. This makes a GPU really bad for running a sequential program, but very good for doing numerous calculations in parallel (like those needed to render graphics). Also, because the GPU was designed to do specific calculations for graphics (instead of anything under the sun like a CPU), the GPU can get away with smaller cores and less complex control logic. The end result is a lot of cores designed to do as many parallel calculations as possible.

The CPU (left) vs the GPU (right). They employ similar core components, but their design priorities allow for drastically different use cases and performance profiles.

To give you an idea of how different the capabilities are, the Intel Xeon 8280 CPU has 896 available threads across all its cores. The Nvidia A100 GPU has 221,184 threads available across all its cores. Thats 245x the number of threads to do parallel computation, all made possible because GPUs don’t care (as much as CPUs) about latency.

GPUs and AI

As you might be able to imagine, graphics aren’t the only thing that can benefit from parallel computation. As GPUs took off, so did their use cases. Quickly GPUs became a fundamental building block in a variety of disciplines, including AI.

Back in the beginning of the article I provided a simplified demonstration of AI.

Recall this simplified demonstration of the math behind artificial intelligence.

Running this simple AI model with a CPU might look something like this:

Add up all the values in the first vector. That would be 24 calculations.
Multiply the result by 3, 2, and -0.1. That would be three more calculations.
Add those results together, two more calculations.
Multiply that result by 0.03 and 0.003, two more calculations.

that’s 31 sequential calculations. Instead, you could parallelize it.

Keep reading with a 7-day free trial

Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.