Systems for Large-Scale Machine Learning

Unfortunately, existing systems for large-scale, distributed machine learning are inflexible and difficult to use. This was not necessarily a problem when models had only tens of millions of parameters, but it is increasingly problematic as models consisting of tens of billions of parameters are common.

For example, the largest iteration of the LLaMA large-language model released by MetaAI has 65 billion parameters, and yet it is hard-coded to work on a server having eight GPUs. Not everyone has such a server! What if a potential user has a server with four GPUs? Or what one has six servers available, with fur GPUs each? What if no GPUs are available at all, one has access to a compute cluster with eight CPU machines? In any of these cases, a potential user would have to write a lot of code to get LLaMA to work, and this development task is likely beyond all but the most sophisticated users. The problems are myriad; for example, the 65B parameter LLaMA model itself comes "pre-decomposed" to run on eight GPUs. That is, various tensors composing the model have all been broken up eight ways, and a programmer wanting to run the model on different hardware must "re-compose" the tensors, and then break them up again for the new hardware.

The point is not that there is a problem with the LLaMA model itself; MetaAI has done a tremendous service in releasing this model for widespread use. Rather, the point is that LLaMA is implemented on top of existing machine learning system technology (PyTorch), and that has implications for the flexibility of the model.

Our goal is to design new systems for machine learning, from the ground up, that allow for maximum flexibility. All that a programmer needs to do is to specify the model; breaking up the resulting computation—learning or inference---to run on different devices or different machines in a distributed cluster is then automatic. A lot of our ideas for machine learning system design are rooted in techniques used by distributed and parallel database designers for decades. An ML system should automatically figure out the best way to execute a distributed computation, just like a database system automatically figures out the best way to run a distributed SQL computation. The SQL does not change, no matter what the underlying hardware.

Much of our work is based on the idea of the tensor relational algebra, which says that machine learning systems should decompose large sensors into set of sub-tensors, and then operate on them using standard relational operations such as joins, projections, and aggregations.

For a little bit of depth on our ideas, please take a look at this slide deck detailing our ideas in distributed machine learning system design:

You can watch a video presentation of the slide deck here:

If you would like to see some of the papers we've published on the topic, see the publications section below.

If you would like to meet our team members, see the people section below.

To date, our work has been generously supported by the US National Science Foundation. See our sponsors section below.

Selected Publications

Yuxin Tang, Zhimin Ding, Dimitrije Jankov, Binhang Yuan, Daniel Bourgeois, Chris Jermaine: Auto-Differentiation of Relational Computations for Very Large Scale Machine Learning. To appear, ICML Conference, 2023.

Binhang Yuan, Cameron R. Wolfe, Chen Dun, Yuxin Tang, Anastasios Kyrillidis, Chris Jermaine: Distributed Learning of Fully Connected Neural Networks using Independent Subnet Training. Proc. VLDB Endow. 15(8): 1581-1590 (2022)

Dimitrije Jankov, Binhang Yuan, Shangyu Luo, Chris Jermaine: Distributed Numerical and Machine Learning Computations via Two-Phase Execution of Aggregated Join Trees. Proc. VLDB Endow. 14(7): 1228-1240 (2021)

Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, Chris Jermaine: Tensor Relational Algebra for Distributed Machine Learning System Design. Proc. VLDB Endow. 14(8): 1338-1350 (2021)

Shangyu Luo, Dimitrije Jankov, Binhang Yuan, Chris Jermaine: Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra. SIGMOD Conference 2021: 1222-1234

Shangyu Luo, Zekai J. Gao, Michael N. Gubanov, Luis Leopoldo Perez, Dimitrije Jankov, Christopher M. Jermaine: Scalable linear algebra on a relational database system. Commun. ACM 63(8): 93-101 (2020)

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, Zekai J. Gao: Declarative Recursive Computation on an RDBMS. Proc. VLDB Endow. 12(7): 822-835 (2019)