2017《Sparse Communication for Distributed Gradient Descent》

作者：Aji

Abstract

1.压缩率：99%

2.结合：和梯度量化结合

3.对象：MNIST(大多数配置都表现良好)、神经网络机器翻译任务(不同配置各有好坏)

4.MNIST:49%的提速；NMT：22%的提速。注意：都没有损失准确率。

1.Introduction

介绍了一些文章，我都下载了，看一下！

3.Distributed SGD

4.Sparse Gradient Exchange

2.直接将小于阈值的梯度设置为0会破坏收敛性，因此需要将它们累加到下一个minibatch。

4.取样0.1%选取thresold

5.分别是用了local thresold和global thresold(结合了layer normalization)

5.Experiment

5.1确定压缩率(这个可以动态调整吗？)

2.99.9%的压缩率导致很不好的压缩性能，99%的几乎不影响性能，

5.2局部threshold VS 全局thresold

1.只实现了thresold_push，没有实现thresold_pull(Based on the results and due to the complicated interaction with sharding, we did not implement locally thresholded pulling, so only locally thresholded pushing is shown.)

2.layer normalization只对NMT有效，对MNIST几乎没什么影响。

5.3收敛速度

通过缩小minibatch大小，强行增加通信时间，然后说自己的梯度压缩增快了收敛速度，这个实验稍显牵强。

5.4 1-bit quantization

1.列举了一些梯度量化的方法

2.三种层次的量化方法：min thresold、column、_wise average thresold、global average。（实验图看应该是global average thresold效果最好感觉这个依赖具体场景）

3.1-bit可能影响收敛速度，2-bit一般是足够的，因为它可以将小梯度和大梯度区分开了。