You make a lot of good points in here but I think you are slightly off on a couple key points.
These are ARM not x64 so they use SVE2 which can technically scale to 2048 rather than 512 of AVX. Did they scale it to that, I’m unsure, existing Grace products are 4x128 so possibly not.
Second this isn’t meant to be a performant device, it is meant to be a capable device. You can’t easily just make a computer that can handle the compute complexity that this device is able to take on for local AI iteration. You wouldn’t deploy with this as the backend, it’s a dev box.
Third the CXL and CHI specs have coverage for memory scoped out of the bounds of the host cache width. That memory might not be accessible to the CPU but there are a few ways they could optimize that. The fact that they have an all in a box custom solution means they can hack in some workarounds to execute the complex workloads.
I’d want to see how this performs versus an i9 + 5090 workstation but even that is going to already go beyond the price point for this device. Currently a 4090 is able to handle ~20b params which is an order of magnitude smaller than what this can handle.
You make a lot of good points in here but I think you are slightly off on a couple key points.
These are ARM not x64 so they use SVE2 which can technically scale to 2048 rather than 512 of AVX. Did they scale it to that, I’m unsure, existing Grace products are 4x128 so possibly not.
Second this isn’t meant to be a performant device, it is meant to be a capable device. You can’t easily just make a computer that can handle the compute complexity that this device is able to take on for local AI iteration. You wouldn’t deploy with this as the backend, it’s a dev box.
Third the CXL and CHI specs have coverage for memory scoped out of the bounds of the host cache width. That memory might not be accessible to the CPU but there are a few ways they could optimize that. The fact that they have an all in a box custom solution means they can hack in some workarounds to execute the complex workloads.
I’d want to see how this performs versus an i9 + 5090 workstation but even that is going to already go beyond the price point for this device. Currently a 4090 is able to handle ~20b params which is an order of magnitude smaller than what this can handle.