Network is trained using same principles as Megatron LM, inference alone will require 4 A100s
No information on how they targeted the size/compute, or a loss curve, so that probably means they undertrained it, because no one can resist the temptation to claim a bigger parameter-count than they actually have the compute-budget for, and it won’t outperform models you’d expect it to. (And I don’t mean Chinchilla, I mean OPT and GPT-J-20b.)
All about open source! Feel free to ask questions, and share news, and interesting stuff!
Community icon from opensource.org, but we are not affiliated with them.