MLA'15 - Chinese Workshop on Machine Learning and Applications 2015

题目： Making super large-scale machine learning possible
报告人：刘铁岩博士微软亚洲研究院
摘要： The capability of learning super big models is becoming crucial in this big data era. For example, one may need to learn an LDA model with millions of topics, or a word embedding model with billions of parameters. However, it turns out that training such big models is very challenging: with the state-of-the-art machine learning technologies, one has to use a huge number (e.g., thousands) of machines for this purpose, which is clearly beyond the capability of common machine learning practitioners. In this research, we want to answer the question whether it is possible to train super big machine learning models using just a modest computer cluster. To achieve this goal, we focus on two kind of innovations. First, we make important modifications to the training procedure of existing machine learning algorithms, to make them much more cost-effective. For instance, we propose a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm for LDA, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art LDA samplers. For another instance, we adopt a new, distribution-based training process for word embedding, which transforms huge training data to a modest-sized histogram, and therefore significantly reduces the requirement of memory and disk capacity. Second, we develop a new parameter server based distributed machine learning framework, which specifically targets the efficient training of super big models. By using separate data structures for high- and low-frequency parameters, the framework allows extremely large models to fit in memory, while maintaining high access speed. By using a so-called model scheduling scheme, the framework allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth. By using automatic pipelining of model training and network communication, the framework can achieve very high training speed regardless of various conditions of computational resources and network connections. Our experimental results show that with a modest cluster of just 24 machines, we can train a LDA model with 1 million topics and a 20-million-word vocabulary, or a word embedding model with 1000 dimensions and a 20-million-word vocabulary, on a Web document collection with 200-billion tokens — a scale not yet reported even with thousands of machines.