题名

低複雜度卷積神經網路訓練與其低功耗運算單元電路設計

并列篇名

Low-complexity Convolution Neural Network Training and Low Power Circuit Design of its Processing Element

DOI

10.6342/NTU201704424

作者

林柏成

关键词

深度神經網路 ; 卷積神經網路 ; 量化 ; ImageNet ; deep neural network ; convolution neural network ; quantization ; ImageNet

期刊名称

臺灣大學電子工程學研究所學位論文

卷期/出版年月

2017年

学位类别

碩士

导师

闕志達

内容语文

繁體中文

中文摘要

近幾年,深度神經網路(Deep Neural Network)與人工智慧研究因進步的電腦科技而再度被廣泛研究。 神經網路有數種類型,包括: 多層感知器(MLP),卷積神經網路(CNN),遞歸神經網路(RNN)等,其中卷積神經網路又被廣泛地應用在影像處理上,諸如影像辨識,物件偵測,自然語言處理,甚至下圍棋; 到了最近,卷積神經網路深度已經可含有百層以上,能解決困難的任務,但是同時,計算上複雜度與傳統多層感知器相比也提高許多。 卷積神經網路透過不斷前傳(Forward)影像通過網路計算,與反向(Backward)傳遞誤差值經網路計算,調整網路權重,在損失面上找尋最低點,直到找到最低點為止,得到一最佳模型; 有了最佳模型,只需將輸入資料通過此網路即可得到網路推理結果。 可見在訓練階段需要消耗大量的計算。 本論文使用Floating-point signed digit (FloatSD)演算法,套用在網路訓練與推理上以減輕計算複雜度。 另外,我們再針對網路訓練與推理過程中的各層神經元輸出,以及反向傳遞錯誤值做量化以節省更多的計算。 我們證實深度卷積神經網路在訓練時不需要32位元浮點數,即可達到相近的結果。 本論文使用柏克萊大學人工智慧研究中心(BAIR)所開發的Caffe平台做為平台,透過修改Caffe的原始碼實現FloatSD以及其他參數的量化演算法。 我們使用三種影像辨識領域的指標資料集: MNIST、CIFAR-10、ImageNet (ILSVRC)三種應用做實驗,結果證實在小型影像辨識如MNIST與CIFAR-10上,FloatSD訓練甚至比浮點數訓練還佳; 即便拓展到大型影像辨識如ImageNet上,不需要以浮點數預先訓練的權重,直接用FloatSD演算法即可從頭開始訓練,以top-5正確率超過90%的網路為實驗,得到與浮點版本相差僅0.8%的結果。 本論文除了軟體模擬外,亦針對FloatSD設計其運算單元硬體電路,是為組成正在設計中的通用型神經網路晶片之運算單元。 使用FloatSD演算法,時脈閘控,與零項排序技術後,與32位元浮點數版本電路相比,面積是其16.6%,功耗則是0.72%至10.8%。

英文摘要

In recent years, deep neural networks and AI research had attracted much attention. There are several types of neural networks, including multilayer perceptron (MLP), convolution neural network (CNN), recurrent neural network (RNN). Among these architectures, convolution neural network had been widely used in image processing task including, but not limited to image classification, object detection, natural language processing, even GO games. Recently, it has been showed that CNN can be built with hundreds of layers in order to solve tough tasks, however, at the same time require much more computing effort compare to traditional MLP. Convolution neural network can be trained by iteratively passing training data forward through network, passing output error backward through network, adjusting weight of network, traversing on the loss surface to get to the global minimum for the best model. With the trained model, one can pass the data through the network once and get the inference result. We introduced the Floating-point signed digit (FloatSD) algorithm for training and inference phase of CNN to reduced computational effort. In addition, we quantize the neuron output of each layer and the backward delta error for more computational saving. We show that it's not necessary to use 32 bit floating point at training phase in order to get similar results. We implement our FloatSD and quantizing algorithms by modifying the source code of the well known deep learning framework called Caffe, developed by Berkeley Artificial Intelligence Research center. Three famous image classification dataset: MNIST, CIFAR-10, ImageNet are used throughout our experiments. Results show that we can get better result at MNIST and CIFAR-10 datasets. Even at ImageNet dataset, we are able to train from scratch by our proposed algorithm and obtain a 90%-top-5-accuracy model and get only 0.8% degradation of top-5 accuracy. In addition to software simulation, we also design the circuit of processing element of FloatSD, which will be the computational module of our on-going general purpose neural network chip. Using FloatSD algorithm, clock gating, zero-sorting technique, our circuit area and power consumption is 16.6% and 0.72% to 10.7% of floating point version respectively compared with 32bit floating point counterpart.

主题分类 電機資訊學院 > 電子工程學研究所
工程學 > 電機工程
工程學 > 電機工程
参考文献
  1. [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
    連結:
  2. [2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012
    連結:
  3. [4] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
    連結:
  4. [7] P. Norman. et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit TM,” ArXiv:1704.04760v1 [cs.AR], 2017.
    連結:
  5. [9] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, O. Temam, "DaDianNao: A machine-learning supercomputer," in Proc. of 2014 47th Annual IEEE/ACM International Symposium on MICRO, Dec 2014, pp. 609-622.
    連結:
  6. [10] Y. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss : An EnergyEfficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Future of Deep Learning Recognit ion DCNN Accelerator is Crucial • High Throughput for Real-time,” in Proc. of IEEE Int. Solid-State Circuits Conf. , Feb. 2016, pp. 1–43.
    連結:
  7. [11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, W. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Arxiv: 1602.01528, 2016.
    連結:
  8. [12] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H.-J. Yoo, “A 0.62 mW ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on Haar-like face detector,” in Proc. of IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2017, pp. 248–250.
    連結:
  9. [20] D. D. Lin, S. S. Talathi, "Overcoming Challenges in Fixed Point Training of Deep Convolutional Networks," ArXiv:1607.02241 [cs.LG], 2016.
    連結:
  10. [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition," in Proc. of the IEEE, vol. 86, no.11, pp. 2278-2324, November 1998.
    連結:
  11. [30] K.-H. Chen, C.-N. Chen, and T.-D. Chiueh, “Grouped signed power-of-two algorithms for low-complexity adaptive equalization,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 52, no. 12, pp. 816–820, Dec. 2005.
    連結:
  12. [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, pp. 1–14, Sep. 2014
    連結:
  13. ArXiv:1711.02213 [cs.LG], 2017.
    連結:
  14. [42] X. Han, D. Zhou, S. Wang, S. Kimura, “CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor for Forward and Backward Propagation of Convolutional Neural Networks,” ArXiv: 1703.07348 [cs.LG], 2017.
    連結:
  15. [3] M. Bojarski et al. (2016). “End to end learning for self-driving cars.” [Online]. Available: https://arxiv.org/abs/1604.07316
  16. [5] https://whatsthebigdata.com/2017/01/12/deep-learning-at-google/
  17. [6] https://seekingalpha.com/article/3983127-googles-tensor-processing -unit-ai-market-shifting
  18. [8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM SIGARCH Comput. Archit. News, vol. 42, no. 1, Apr. 2014, pp. 269–284.
  19. [13] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. “Cambricon-X: An Accelerator for Sparse Neural Networks,” in Proc. of 49th Annual IEEE/ACM International Symposium on MICRO, Oct. 2016.
  20. [14] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. “Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing,” in Proc. of the International Symposium on Computer Architecture (ISCA), June 2016, pp. 1-13.
  21. [15] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. 2017. “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” in Proc. of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, pp. 27-40.
  22. [16] D. Lin, S. Talathi, V. S. Annapureddy, “Fixed Point Quantization of Deep Convolutional Networks,” ArXiv:1511.06393 [cs.LG], 2016.
  23. [17] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” ArXiv:1404.0736 [cs.CV], 2014.
  24. [18] S. Han, J. Pool, J. Tran, and W. J. Dally. “Learning Both Weights and Connections for Efficient Neural Networks,” in Proc. of the International Conference on Neural Information Processing Systems (NIPS), December 2015, pp. 1135-1143.
  25. [19] S. Han, H. Mao, W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” ArXiv:1510.00149v5 [cs.CV], 2015.
  26. [21] M. Courbariaux, Y. Bengio, J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” ArXiv:1511.00363v3 [cs.LG], 2015.
  27. [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, “Quantized neural networks: training neural networks with low precision weights and activation,” ArXiv:1609.07061v1 [cs.NE], 2016.
  28. [23] F. Li, B. Zhang, B. Liu, “Ternary Weight Networks.” ArXiv:1605.04711v2 [cs.CV], 2016.
  29. [24] Sigmoid. https://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html
  30. [25] V. Nair and G. E. Hinton. “Rectified linear units improve restricted boltzmann machines,” in Proc. of 27th International Conference on Machine Learning, 2010.
  31. [26] Batch gradient descent. https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent
  32. [27] Stochastic gradient descent. http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/
  33. [28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015.
  34. [31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. of the ACM International Conference on Multimedia, 2014, pp. 675–678.
  35. [32] NVIDIA cuDNN. https://developer.nvidia.com/cudnn, 2016.
  36. [33] MNIST. http://yann.lecun.com/exdb/mnist/
  37. [34] CIFAR-10. https://www.cs.toronto.edu/~kriz/cifar.html
  38. [35] ImageNet. http://image-net.org, 2016.
  39. [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf. on Comput. Vis. Pattern Recognit. (CVPR), 2016.
  40. [38] S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ArXiv:1502.03167v3 [cs.LG], 2015.
  41. [39] U. Köster, T. Webb, X. Wang, M. Nassar, A. Bansal, W. Constable, O. Elibol, S. Hall, L, Hornof, A. Khosrowshahi, C. Kloss, R. Pai, N. Rao, “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks,”
  42. [40] A. Neelakantan , L. Vilnis, Quoc V. Le, I. Sutskever, L. Kaiser, K. Kurach, J. Martens. “Adding Gradient Noise Improves Learning for Very Deep Networks,” ArXiv:1511.06807 [stat.ML], 2015.
  43. [41] W. Zhao, H. Fu, W. Luk, T. Yu, S. Wang, B. Feng, Y. Ma, G. Yang, “F-CNN: An FPGA-based Framework for Training Convolutional Neural Networks,” in Proc. of IEEE Conf. on Application-specific Systems, Architectures and Processors, London, UK, July 2016.
  44. [43] Z. Yuan, Y. Liu, J. Yue, J. Li, H. Yang, “CORAL: Coarse-grained Reconfigurable Architecture for ConvoLutional Neural Networks,” in Proc. of IEEE Conf. on International Symposium on Low Power Electronics and Design, Taipei, Taiwan, July 2017.
  45. [44] https://towardsdatascience.com/neural-network-architectures-156e5bad51ba