Andrew Feldman, Co-founder & CEO of Cerebras Programs - Interview Collection

Andrew is co-founder and CEO of Cerebras Programs. He’s an entrepreneur devoted to pushing boundaries within the compute area. Previous to Cerebras, he co-founded and was CEO of SeaMicro, a pioneer of energy-efficient, high-bandwidth microservers. SeaMicro was acquired by AMD in 2012 for $357M. Earlier than SeaMicro, Andrew was the Vice President of Product Administration, Advertising and BD at Force10 Networks which was later offered to Dell Computing for $800M. Previous to Force10 Networks, Andrew was the Vice President of Advertising and Company Growth at RiverStone Networks from the corporate’s inception by means of IPO in 2001. Andrew holds a BA and an MBA from Stanford College.

Cerebras Programs is constructing a brand new class of laptop system, designed from first ideas for the singular aim of accelerating AI and altering the way forward for AI work.

Might you share the genesis story behind Cerebras Programs?

My co-founders and I all labored collectively at a earlier startup that my CTO Gary and I began again in 2007, referred to as SeaMicro (which was offered to AMD in 2012 for $334 million). My co-founders are a few of the main laptop architects and engineers within the business – Gary Lauterbach, Sean Lie, JP Fricker and Michael James. After we acquired the band again collectively in 2015, we wrote two issues on a whiteboard – that we needed to work collectively, and that we needed to construct one thing that might rework the business and be within the Pc Historical past Museum, which is the equal to the Compute Corridor of Fame. We had been honored when the Pc Historical past Museum acknowledged our achievements and added WSE-2 processor to its assortment final yr, citing the way it has reworked the substitute intelligence panorama.

Cerebras Programs is a workforce of pioneering laptop architects, laptop scientists, deep studying researchers, and engineers of every type who love doing fearless engineering. Our mission after we got here collectively was to construct a brand new class of laptop to speed up deep studying, which has risen as probably the most essential workloads of our time.

We realized that deep studying has distinctive, huge, and rising computational necessities. And it’s not well-matched by legacy machines like graphics processing models (GPUs), which had been essentially designed for different work. In consequence, AI in the present day is constrained not by purposes or concepts, however by the provision of compute. Testing a single new speculation – coaching a brand new mannequin – can take days, weeks, and even months and price tons of of hundreds of {dollars} in compute time. That’s a significant roadblock to innovation.

So the genesis of Cerebras was to construct a brand new sort of laptop optimized solely for deep studying, ranging from a clear sheet of paper. To satisfy the big computational calls for of deep studying, we designed and manufactured the biggest chip ever constructed – the Wafer-Scale Engine (WSE). In creating the world’s first wafer-scale processor, we overcame challenges throughout design, fabrication and packaging – all of which had been thought of inconceivable for all the 70-year historical past of computer systems. Each ingredient of the WSE is designed to allow deep studying analysis at unprecedented speeds and scale, powering the business’s quickest AI supercomputer, the Cerebras CS-2.

With each element optimized for AI work, the CS-2 delivers extra compute efficiency at much less area and fewer energy than another system. It does this whereas radically decreasing programming complexity, wall-clock compute time, and time to resolution. Relying on workload, from AI to HPC, CS-2 delivers tons of or hundreds of instances extra efficiency than legacy options. The CS-2 offers the deep studying compute sources equal to tons of of GPUs, whereas offering the convenience of programming, administration and deployment of a single system.

Over the previous few months Cerebras appears to be everywhere in the information, what are you able to inform us in regards to the new Andromeda AI supercomputer?

We introduced Andromeda in November of final yr, and it is without doubt one of the largest and strongest AI supercomputers ever constructed. Delivering greater than 1 Exaflop of AI compute and 120 Petaflops of dense compute, Andromeda has 13.5 million cores throughout 16 CS-2 methods, and is the one AI supercomputer to ever display near-perfect linear scaling on giant language mannequin workloads. It’s also useless easy to make use of.

By means of reminder, the biggest supercomputer on Earth – Frontier – has 8.7 million cores. In uncooked core rely, Andromeda is a couple of and a half instances as giant. It does completely different work clearly, however this provides an thought of the scope: almost 100 terabits of inside bandwidth, almost 20,000 AMD Epyc cores feed it, and – not like the enormous supercomputers which take years to face up – we stood Andromeda up in three days and instantly thereafter, it was delivering close to excellent linear scaling of AI.

Argonne Nationwide Labs was our first buyer to make use of Andromeda, they usually utilized it to an issue that was breaking their 2,000 GPU cluster referred to as Polaris. The issue was operating very giant, GPT-3XL generative fashions, whereas placing all the Covid genome within the sequence window, in order that you would analyze every gene within the context of all the genome of Covid. Andromeda ran a singular genetic workload with lengthy sequence lengths (MSL of 10K) throughout 1, 2, 4, 8 and 16 nodes, with near-perfect linear scaling. Linear scaling is amongst probably the most sought-after traits of a giant cluster. Andromeda delivered 15.87X throughput throughout 16 CS-2 methods, in comparison with a single CS-2, and a discount in coaching time to match.

Might you inform us in regards to the partnership with Jasper that was unveiled in late November and what it means for each corporations?

Jasper’s a extremely attention-grabbing firm. They’re a pacesetter in generative AI content material for advertising, and their merchandise are utilized by greater than 100,000 clients around the globe to put in writing copy for advertising, adverts, books, and extra. It’s clearly a really thrilling and quick rising area proper now. Final yr, we introduced a partnership with them to speed up adoption and enhance the accuracy of generative AI throughout enterprise and client purposes. Jasper is utilizing our Andromeda supercomputer to coach its profoundly computationally intensive fashions in a fraction of the time. It will prolong the attain of generative AI fashions to the plenty.

With the ability of the Cerebras Andromeda supercomputer, Jasper can dramatically advance AI work, together with coaching GPT networks to suit AI outputs to all ranges of end-user complexity and granularity. This improves the contextual accuracy of generative fashions and can allow Jasper to personalize content material throughout a number of courses of consumers rapidly and simply.

Our partnership permits Jasper to invent the way forward for generative AI, by doing issues which are impractical or just inconceivable with conventional infrastructure, and to speed up the potential of generative AI, bringing its advantages to our quickly rising buyer base across the globe.

In a current press launch, the Nationwide Power Know-how Laboratory and Pittsburgh Supercomputing Heart Pioneer introduced the primary ever Computational Fluid Dynamics Simulation on the Cerebras wafer-scale engine. Might you describe what particularly is a wafer-scale engine and the way it works?

Our Wafer-Scale Engine (WSE) is the revolutionary AI processor for our deep studying laptop system, the CS-2. Not like legacy, general-purpose processors, the WSE was constructed from the bottom as much as speed up deep studying: it has 850,000 AI-optimized cores for sparse tensor operations, huge excessive bandwidth on-chip reminiscence, and interconnect orders of magnitude quicker than a standard cluster might probably obtain. Altogether, it offers you the deep studying compute sources equal to a cluster of legacy machines all in a single system, straightforward to program as a single node – radically decreasing programming complexity, wall-clock compute time, and time to resolution.

Our second era WSE-2, which powers our CS-2 system, can remedy issues extraordinarily quick. Quick sufficient to permit real-time, high-fidelity fashions of engineered methods of curiosity. It’s a uncommon instance of profitable “sturdy scaling”, which is the usage of parallelism to cut back remedy time with a hard and fast measurement drawback.

And that’s what the Nationwide Power Know-how Laboratory and Pittsburgh Supercomputing Heart are utilizing it for. We simply introduced some actually thrilling outcomes of a computational fluid dynamics (CFD) simulation, made up of about 200 million cells, at close to real-time charges. This video reveals the high-resolution simulation of Rayleigh-Bénard convection, which happens when a fluid layer is heated from the underside and cooled from the highest. These thermally pushed fluid flows are all spherical us – from windy days, to lake impact snowstorms, to magma currents within the earth’s core and plasma motion within the solar. Because the narrator says, it’s not simply the visible great thing about the simulation that’s essential: it’s the velocity at which we’re in a position to calculate it. For the primary time, utilizing our Wafer-Scale Engine, NETL is ready to manipulate a grid of almost 200 million cells in almost real-time.

What sort of information is being simulated?

The workload examined was thermally pushed fluid flows, often known as pure convection, which is an utility of computational fluid dynamics (CFD). Fluid flows happen naturally throughout us — from windy days, to lake impact snowstorms, to tectonic plate movement. This simulation, made up of about 200 million cells, focuses on a phenomenon often known as “Rayleigh-Bénard” convection, which happens when a fluid is heated from the underside and cooled from the highest. In nature, this phenomenon can result in extreme climate occasions like downbursts, microbursts, and derechos. It’s additionally answerable for magma motion within the earth’s core and plasma motion within the solar.

Again in November 2022, NETL launched a brand new discipline equation modeling API, powered by the CS-2 system, that was as a lot as 470 instances quicker than what was attainable on NETL’s Joule Supercomputer . This implies it might ship speeds past what both clusters of any variety of CPUs or GPUs can obtain. Utilizing a easy Python API that allows wafer-scale processing for a lot of computational science, WFA delivers positive factors in efficiency and value that would not be obtained on typical computer systems and supercomputers – in actual fact , it outperformed OpenFOAM on NETL’s Joule 2.0 supercomputer by over two orders of magnitude in time to resolution.

Due to the simplicity of the WFA API, the outcomes had been achieved in only a few weeks and proceed the shut collaboration between NETL, PSC and Cerebras Programs.

By remodeling the velocity of CFD (which has all the time been a gradual, off-line job) on our WSE, we will open up a complete raft of latest, real-time use instances for this, and lots of different core HPC purposes. Our aim is that by enabling extra compute energy, our clients can carry out extra experiments and invent higher science. NETL lab director Brian Anderson has advised us that this may drastically speed up and enhance the design course of for some actually large initiatives that NETL is engaged on round mitigating local weather change and enabling a safe power future — initiatives like carbon sequestration and blue hydrogen manufacturing.

Cerebras is persistently outperforming the competitors in relation to releasing supercomputers, what are a few of the challenges behind constructing cutting-edge supercomputers?

Satirically, one of many hardest challenges of massive AI shouldn’t be the AI. It’s the distributed compute.

To coach in the present day’s state-of-the-art neural networks, researchers typically use tons of to hundreds of graphics processing models (GPUs). And it’s not straightforward. Scaling giant language mannequin coaching throughout a cluster of GPUs requires distributing a workload throughout many small units, coping with system reminiscence sizes and reminiscence bandwidth constraints, and thoroughly managing communication and synchronization overheads.

We’ve taken a totally completely different method to designing our supercomputers by means of the event of the Cerebras Wafer-Scale Cluster, and the Cerebras Weight Streaming execution mode. With these applied sciences, Cerebras addresses a brand new solution to scale based mostly on three key factors:

The alternative of CPU and GPU processing by wafer-scale accelerators such because the Cerebras CS-2 system. This alteration reduces the variety of compute models wanted to attain an appropriate compute velocity.

To satisfy the problem of mannequin measurement, we make use of a system structure that disaggregates compute from mannequin storage. A compute service based mostly on a cluster of CS-2 methods (offering ample compute bandwidth) is tightly coupled to a reminiscence service (with giant reminiscence capability) that gives subsets of the mannequin to the compute cluster on demand. As ordinary, a knowledge service serves up batches of coaching knowledge to the compute service as wanted.

An progressive mannequin for the scheduling and coordination of coaching work throughout the CS-2 cluster that employs knowledge parallelism, layer at a time coaching with sparse weights streamed in on demand, and retention of activations within the compute service.

There’s been fears of the tip of Moore’s Regulation for near a decade, what number of extra years can the business squeeze in and what forms of improvements are wanted for this?

I feel the query we’re all grappling with is whether or not Moore’s Regulation – as written by Moore – is useless. It isn’t taking two years to get extra transistors. It’s now taking 4 or 5 years. And people transistors aren’t coming on the identical worth – they’re coming in at vastly larger costs. So the query turns into, are we nonetheless getting the identical advantages of shifting from seven to 5 to 3 nanometers? The advantages are smaller they usually value extra, and so the options develop into extra difficult than merely the chip.

Jack Dongarra, a number one laptop architect, gave a chat not too long ago and stated, “We’ve gotten significantly better at making FLOPs and at making I/O.” That’s actually true. Our means to maneuver knowledge off-chip lags our means to extend the efficiency on a chip by a fantastic deal. At Cerebras, we had been completely satisfied when he stated that, as a result of it validates our choice to make a much bigger chip and transfer much less stuff off-chip. It additionally offers some steering on future methods to make methods with chips carry out higher. There’s work to be finished, not only a wringing out extra FLOPs but in addition in strategies to maneuver them and to maneuver the information from chip to chip — even from very large chip to very large chip.

Is there the rest that you just want to share about Cerebras Programs?

For higher or worse, folks typically put Cerebras on this class of “the actually large chip guys.” We’ve been in a position to present compelling options for very, very giant neural networks, thereby eliminating the necessity to do painful distributed computing. I consider that’s enormously attention-grabbing and on the coronary heart of why our clients love us. The attention-grabbing area for 2023 will probably be how you can do large compute to the next stage of accuracy, utilizing fewer FLOPs.

Our work on sparsity offers a particularly attention-grabbing method. We don’t do work that doesn’t transfer us in the direction of the aim line, and multiplying by zero is a foul thought. We’ll be releasing a extremely attention-grabbing paper on sparsity quickly, and I feel there’s going to be extra effort is taking a look at how we get to those environment friendly factors, and the way will we achieve this for much less energy. And never only for much less energy and coaching; how will we reduce the fee and energy utilized in inference? I feel sparsity helps on each fronts.

Thanks for these in-depth solutions, readers who want to be taught extra ought to go to Cerebras Programs.