17 C
New York

AMD’s MI350 Is A Big Iterative Advance

Published:

At their Advancing AI conference, AMD announced the MI350/355 GPUs along with a few specs. SemiAccurate has a technical piece coming but for now we will just cover the high level specs.

AMD MI350 specs

More power leads to more performance

The most obvious question is what is the difference between the MI350 and MI355? That answer is easy, MI350 is air cooled, MI355 is direct liquid cooled (DLC). Because of this the MI350 is capped at 1000W and the MI355 will sip a mere 1400W. According to the raw AMD numbers, the performance difference is about 10% on a raw flops basis but the real world delta is more towards 20% at the system level. Depending on how AMD prices things, the MI355 seems to be the no-brainer choice.

AMD MI350 construction

MI350 construction is similar but different

MI35x itself is a ground up new architecture for the device but the platform itself remains pretty close to the existing MI300 variants. The same can’t be said at the silicon level where there was one big architectural change, the base/IO die. MI300 had four IODs, one per compute die. This worked out fine and brought a lot of flexibility to the family. You could mix and match CPU and GPU tiles on whim and AMD did just that. Unfortunately the market didn’t care, or at least care enough to warrant future heterogeneous CPU/GPU hybrids.

Without the need for this level of flexibility, and with packaging advances, AMD moved to a two IOD architecture. Each IOD has two CCDs/XCDs on top which still allows them to mix and match CPU and GPU but only two at a time. That said we don’t expect to see a CPU bearing MI35x device this generation. This change also leads to some more interesting memory choices but that is a topic for another article.

AMD MI350 CU performance

AMD’s per-CU performance comparison

At a low level, the per-CU performance is broadly the same on MI35x as it was on MI300, at least for traditional data types. On the newer and much more broadly used AI datatypes such as BF16 and lower INT widths, the MI35x doubles the performance of it’s predecessor. Do note this is per CU, not per device. Each Accelerator Complex Die (XCD) has 256 active CUs per die which is down from 304 in the MI300 generation but net performance goes way up.

AMD MI350 training performance

If you want to train Llama 3 yourself…..

AMD is claiming ~3x performance gains over MI300 in Llama 3 training but benchmarks on this scale can be a little murky. This isn’t a criticism, it is just not easy compare things on this scale independently for obvious reasons. In any case the advances that MI35x brings to the table are real and come from a mix of low level hardware advances, device advances, memory, IO, and software changes. In short the end user visible performance gains should be more than 2x and exceed that with some software work.

Overall the MI35x family isn’t a sea change over the MI300 generation but it it brings some serious advances to the table. Hardware isn’t the most important factor for AI purchases to many customers, software often is. AMD hasn’t ignored this front either with the release of ROCM7 and various other tools and frameworks. The overall result is a claimed leap in training performance and a large perf/dollar advantage over Nvidia. It will be interesting to see how it all plays out in the market.S|A

The following two tabs change content below.

Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate

Source link

Related articles

Recent articles