...
For example, the largest iteration of the LLaMA large-language model released by MetaAI has 65 billion parameters, and yet it is hard-coded to work on a server having eight GPUs. Not everyone has such a server! What if a potential user has a server with four GPUs? Or what one has six servers available, with fur GPUs each? What if no GPUs are available at all, one has access to a compute cluster with eight CPU machines? In any of these cases, a potential user would have to write a lot of code to get LLaMA to work, and this development task is likely beyond all but the most sophisticated users. The problems are myriad; for example, the 65B parameter LLaMA model itself comes "pre-decomposed" to run on eight GPUs. That is, various tensors composing the model have all been broken up eight ways, and a programmer wanting to run the model on different hardware must "re-compose" the tensors, and then break them up again for the new hardware.
...