DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

vegeta@lemmy.world · 2 days ago

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

Corngood · 2 days ago

This sounds like good engineering, but surely there’s not a big gap with their competitors. They are spending tens of millions on hardware and energy, and this is something a handful of (very good) programmers should be able to pull off.

Unless I’m missing something, It’s the sort of thing that’s done all the time on console games.

KingRandomGuy@lemmy.world · 19 hours ago

Part of this was an optimization that was necessary due to their resource restrictions. Chinese firms can only purchase H800 GPUs instead of H200 or H100. These have much slower inter-GPU communication (less than half the bandwidth!) as a result of export bans by the US government, so this optimization was done to try and alleviate some of that bottleneck. It’s unclear to me if this type of optimization would make as big of a difference for a lab using H100s/H200s; my guess is that it probably matters less.

mormund@feddit.org · 2 days ago

I think more like was done all the time for console games. These days that doesn’t happen as much anymore as far as I know. But I think this shows that CUDA is not a good enough abstraction for modern GPUs or the compilers are not as good as expected. There should be no way they got that much optimization out of hand written/optimized code these days.

1 day ago

Eh, even for many console games it’s not optimised that much.

Check out Kaze Emanaur’s (& co) rewrite of the N64s Super Mario 64 engine. He’s now building an entirely new game on top of that engine, and it looks considerably better than SM64 did and runs at twice the FPS on original hardware.

But you’re probably right that today it happens even less than before.

mormund@feddit.org · 1 day ago

That disregards the massive advancement in technology, hindsight, tooling and theory they can make use of now. There is a world of difference there even with the same hardware. So not comparable imo, it wasn’t for a lack of effort on Nintendo’s part.

1 day ago

A substantial part of the optimisation was simply not compiling as a debug target. There were plenty of oversights by Nintendo devs (not to discredit all they’ve accomplished here). And most tooling for this Kaze developed himself (because who else develops for the N64?).

It’s mostly the result of a couple really clever and passionate people actually taking it apart to a very low level. Nintendo could have absolutely done most of these optimisations themselves, they don’t really rely on many newly discovered techniques or anything. Still, they had deadlines of course, which Kaze & Co. don’t.

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses Nvidia's assembly-like PTX programming instead