题名

時間落差消除以實現低電壓處理器中高能源效率之晶片內建記憶體

并列篇名

Reducing Timing Discrepancy for Energy-Efficient On-Chip Memory Architectures with Low Voltage Processors

作者

王柏皓

关键词

低電壓處理器 ; 錯誤容忍快取記憶體 ; 降低時序差異 ; Low-voltage processor ; Fault-tolerant cache ; Timing discrepancy reducing

期刊名称

交通大學資訊科學與工程研究所學位論文

卷期/出版年月

2017年

学位类别

博士

导师

陳添福

内容语文

英文

中文摘要

動態電壓頻率調變(DVFS)是一個現代處理器系統中減少能量(energy)消耗的有效方法。然而,隨著電壓降低,晶片內建記憶體延遲時間的成長通常較處理器來的快,意味著介於晶片內建記憶體和處理器間的延遲差異將會隨著電壓降低而加大。這導致了晶片內建記憶體和處理器核心之間的時間落差,並導致了系統效能的下降。此時間落差主要來自於其內部之靜態存取記憶體(SRAM)中的少數記憶體單元受到嚴重的製程變異所影響,並且存取時發生存取時間不足的錯誤。幸運的是,這些錯誤可以透過提供足夠的存取時間來補救。過去大部分的容錯設計通常會犧牲記憶體的容量或是增加存取等待時間,因此這些方法並不適用於像是L1高速快取記憶體或是本地記憶體(local memory)等等對存取時間要高度要求的記憶體。而隨著更低的操作電壓和更小的製程技術被引入處理器系統,這些導致存取時間不足的慢速記憶體細胞數量也會隨之增加。因此,要如何容忍大量存取時間錯誤並避免大量的效能代價,將成為處理器系統中,一個關鍵的問題。 為了解決容忍大量存取時間錯誤的問題,在本文中,我們分析了在現代低壓處理器系統中常用的靜態隨機存取記憶體(SRAM)的特性。然後基於這些觀察,針對不同的目的性,提出三種用於晶片內建記憶體的存取時間錯誤容錯技術。 第一個設計是基於8T SRAM的零計數錯誤檢測碼(ZC-EDC),此設計可適用於不同記憶體架構設計,例如快取記憶體,本地記憶體或轉譯後備緩衝區(TLB)。為了達到適用於不同記憶體架構設計的目標,存取時間錯誤容錯設計不容許有記憶體空間上的損失。ZC-EDC使用輕量級錯誤檢測碼('0'計數)動態地檢測存取時間錯誤,這是因為存取時間錯誤僅會發生在8T SRAM讀取資料“0”之時,發現錯誤後再調整存取時間以容忍存取時間錯誤。此外,為了進一步提高L1快取記憶體的平均存取時間,我們分析了快取記憶體上的局部效應(locality effect),並提出了一種對時間感知的LRU策略,以便將常用的資料儲存在較快速的記憶體區塊上。 第二種設計是交叉比對高速快取記憶體(CM-cache),此設計著重於增加使用8T SRAM的L1快取記憶體的存取時間錯誤容忍能力。CM-cache首先基於8T SRAM的特性提出具動態時間校準功能之靜態隨機存取記憶體(DTC-SRAM),以在處理器運行時檢測各快取記憶體區塊所需之存取時間。接著,針對DTC-SRAM,我們針對不同的錯誤容忍程度提出不同的快取記憶體管理策略,這其中包括了位元級的存取時間錯誤遮罩。該設計可以在系統運行時檢測儲存值的影響,並調整所需的存取時間。 第三種設計是一種用於L1快取記憶體的Ally cache。透過在多個快取記憶體區塊中存儲相同的數據,並激發對應的字線(wordline),可以達到“聯合”(ally)高速快取記憶體區塊之效果。在記憶體區塊聯合之後,可以有效的提高快取記憶體的存取速度、實現位元級的存取時間容錯,並提供可靠的低電壓操作。與上述提出的設計不同,Ally cache並沒有利用8T SRAM的特性,因此可以應用於6T、8T和類似的SRAM上。然而,此方法將帶來了大量的容量損失和存取上的能量消耗。故我們提出了針對L1快取記憶體之資料聯合管理策略來減少Ally cache中不必要的能量開銷。

英文摘要

Dynamic Voltage Frequency Scaling (DVFS) is an effective method for saving energy in modern processor systems. Nevertheless, on-chip memory usually exhibits worse latency degradation than do processor cores in low-operating-voltage modes. This causes a timing discrepancy between on-chip memory and cores that degrades the system performance. The timing discrepancy is primarily caused by severe process variations in slow memory cells and produces access-time faults. Fortunately, these faults can be remedied by providing sufficient access time. Previous fault-tolerant designs usually sacrifice the capacity or increase the access latency to tolerate access-time faults, so these methods are not suitable for the latency-sensitive memories such as level 1 (L1) caches. Besides, the number of slow cells is increased by aggressive voltage decreases and technology node advancement. Therefore, tolerating numerous access-time faults without large latency overhead to reduce the timing discrepancy will become a critical issue gradually. To address the issue of tolerating numerous access-time faults, in this dissertation, we analyze the characteristic of static random-access memory (SRAM) that is commonly used in modern processor systems. Base on the observation, three access-time-fault tolerance technologies are proposed for on-chip memories with different purposes in this dissertation. The first design is Zero Counting Error Detection Code (ZC-EDC) that is designed for different memory architectures such as caches, local memories or translation lookaside buffers (TLB) on 8T SRAM. To achieve the target, the proposed access-time-fault tolerance design must have no capacity loss. ZC-EDC use a light-weight error detection code (‘0’ counting) to detect access-time faults dynamically because access-time faults occur only when reading ‘0’ bits on the 8T SRAM, then adapts the access time to tolerate the access-time faults. Besides, to further improve the average memory access time of L1 caches, we analyze the locality effect and propose timing-aware LRU policy to dynamically place hot data on the fast blocks. The second design is a Cross-Matching cache (CM-cache) that focuses on providing high access-time-fault tolerent ability of L1 caches with 8T SRAM. This design first proposes a Dynamic Timing Calibration 8T SRAM (DTC-SRAM) that dynamically calibrates the read latency of each cache line. Then, we propose three different cache strategies for dealing for different usages which includes a bit-level access-time-fault mask. These designs can detect the influence of the stored value at runtime and adaptively adjusts the access time of the L1 cache with 8T SRAM. The third design is an Ally cache which is able to “ally” cache lines by storing the same data and triggering multiple corresponding wordlines to achieve cell-level access-time-faults tolerance and perform reliable low-voltage operation. Different from above proposed designs, Ally cache does not utilize the characteristic of 8T SRAM so it can be applied with 6T, 8T and similar SRAM. However, data ally brings large capacity loss and access overhead. We propose a cache management strategy to reduce unnecessary overhead of Ally cache.

主题分类 基礎與應用科學 > 資訊科學
資訊學院 > 資訊科學與工程研究所
参考文献
  1. [1] C. Wilkerson, H. Gao, et al., "Trading off cache capacity for reliability to enable low voltage operation." in 35th IEEE International Symposium on Computer Architecture, 2008, pp. 203-214.
    連結:
  2. [2] A. Ansari, S. Gupta, et al., "Zerehcache: Armoring cache architectures in high defect density technologies." in 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 100-110.
    連結:
  3. [3] A. Ansari, S. Feng, et al., "Archipelago: A polymorphic cache design for enabling robust near-threshold operation." in 17th IEEE International Symposium on High Performance Computer Architecture, 2011, pp. 539-550.
    連結:
  4. [4] Z. Chishti, A. R. Alameldeen, et al., "Improving cache lifetime reliability at ultra-low voltages." in 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 89-99.
    連結:
  5. [5] A. R. Alameldeen, I. Wagner, et al., "Energy-efficient cache design using variable-strength error-correcting codes." in ACM SIGARCH Computer Architecture News, 2011, pp. 461-472.
    連結:
  6. [6] L. Chang, D. M. Fried, et al., "Stable SRAM cell design for the 32 nm node and beyond." in Symposium on VLSI Technology Digest of Technical Papers, 2005, pp. 128-129.
    連結:
  7. [7] I. J. Chang, J. J. Kim, et al., "A 32 kb 10T sub-threshold SRAM array with bit-interleaving and differential read scheme in 90 nm CMOS." in IEEE Journal of Solid-State Circuits, 2009, pp. 650-658.
    連結:
  8. [8] G. Chen, D. Sylvester, et al., "Yield-driven near-threshold SRAM design." in IEEE Transactions on Very Large Scale Integration systems, 2010, pp. 1590-1598.
    連結:
  9. [9] M. Mutyam, F. Wang, et al., "Process-variation-aware adaptive cache architecture and management." in IEEE Transactions on Computers, 2009, pp. 865-877.
    連結:
  10. [10] R. W. Hamming, "Error detecting and error correcting codes." in Bell System Technical Journal, 1950, pp. 147–160.
    連結:
  11. [11] R. C. Bose, D. K. Ray-Chaudhuri, "On a Class of Error-Correcting Binary Group Codes." in Information and Control, 1960, pp. 68-79.
    連結:
  12. [12] S. Lin, D. J. Costello, "Error Control Coding 2nd Edition", Prentice-Hall inc., 2004.
    連結:
  13. [14] M. F. Chang, M. P. Chen, et al., "A Sub-0.3 V Area-Efficient L-Shaped 7T SRAM With Read Bitline Swing Expansion Schemes Based on Boosted Read-Bitline, Asymmetric-V Read-Port, and Offset Cell VDD Biasing Techniques." in IEEE International Solid-State Circuits Conference, 2013, pp. 2558-2569.
    連結:
  14. [15] K. Takeda, Y. Hagihara, et al., "A read-static-noise-margin-free SRAM cell for low-Vdd and high-speed applications." in IEEE International Solid-State Circuits Conference, 2016, pp. 113–121.
    連結:
  15. [16] L. Chang, D. M. Fried, et. al., "Stable SRAM cell design for the 32 nm node and beyond." in Symposium on VLSI Technology Digest of Technical Papers, 2005, pp. 128-129.
    連結:
  16. [17] J. J. Wu, Y. H. Chen, et al., "A large σVTH/VDD tolerant zigzag 8T SRAM with area-efficient decoupled differential sensing and fast write-back scheme." in IEEE Journal of Solid-State Circuits Conference, 2011, pp. 815-827.
    連結:
  17. [18] I. J. Chang, J. J. Kim, et al., "A 32kB 10T subthreshold SRAM array with bit-interleaving and differential read scheme in 90nm CMOS." in IEEE International Solid-State Circuits Conference Digest of Technical Papers, 2008, pp. 388-622.
    連結:
  18. [19] G. Gerosa, S. Curtis, et al., "A sub-2 W low power IA processor for mobile internet devices in 45 nm high-k metal gate CMOS." in IEEE Journal of Solid-State Circuits Conference, 2009, pp. 73-82.
    連結:
  19. [20] Y. Nakata, S. Okumura, et al., "0.5-V operation variation-aware word-enhancing cache architecture using 7T/14T hybrid SRAM." in 16th ACM/IEEE International Symposium on Low Power Electronics and Design, 2010.
    連結:
  20. [21] H. Fujiwara, S. Okumura, et al., "A 7T/14T dependable SRAM and its array structure to avoid half selection." in 22nd International Conference on VLSI Design, 2009.
    連結:
  21. [22] J. Jung, Y. Nakata, et al., "256-KB associativity-reconfigurable cache with 7T/14T SRAM for aggressive DVS down to 0.57 V." in 18th IEEE International Conference on Electronics, Circuits and Systems, 2011.
    連結:
  22. [23] S. Hong, S. Kim, "AVICA: An access-time variation insensitive L1 cache architecture." in Design, Automation and Test in Europe, 2013, pp. 65–70.
    連結:
  23. [25] D. Ernst, N. S. Kim, et al., "Razor: A low-power pipeline based on circuit-level timing speculation." in 36th IEEE/ACM International Symposium on Microarchitecture, 2003, pp. 7-18.
    連結:
  24. [26] S. Das, C. Tokunaga, et al., "RazorII: In situ error detection and correction for PVT and SER tolerance." in IEEE International Solid-State Circuits Conference, 2009, pp. 32-48.
    連結:
  25. [27] M. Fojtik,, D. Fick, et al., "Bubble Razor: An architecture-independent approach to timing-error detection and correction." in IEEE International Solid-State Circuits Conference, 2012, pp. 488-490.
    連結:
  26. [28] S. Mukhopadhyay, H. Mahmoodi, et al., "Modeling of failure probability and statistical design of SRAM array for yield enhancement in nanoscaled CMOS." in IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 2005, pp. 1859-1880.
    連結:
  27. [29] E. Humenay, D. Tarjan, et al., "Impact of parameter variations on multi-core chips." in University of Virginia Department of Computer Science, Charlottesville, 2006.
    連結:
  28. [30] A. Moshovos, B. Falsafi, et al., "A case for asymmetric-cell cache memories." in IEEE Transactions on Very Large Scale Integration Systems, 2005, pp. 877-881.
    連結:
  29. [31] A. A. Mazreah, M. R. Sahebi, et al., "A novel zero-aware four-transistor SRAM cell for high density and low power cache application." in International Conference on Advanced Computer Theory and Engineering, 2008, pp. 571-575.
    連結:
  30. [37] M. R. Guthaus, J. S. Ringenberg, et al., "MiBench: A free, commercially representative embedded benchmark suite." in IEEE International Workshop on Workload Characterization, 2001, pp. 3-14.
    連結:
  31. [39] M. Yabuuchi, Y. Tsukamoto, et. al., "20nm High-density single-port and dual-port SRAMs with wordline-voltage-adjustment system for read/write assists." in IEEE International Solid-State Circuits Conference Digest of Technical Papers, 2014, pp. 234-235.
    連結:
  32. [40] M. F. Chang, C. F. Chen, et al., "A 28nm 256kb 6T-SRAM with 280mV improvement in V MIN using a dual-split-control assist scheme." in IEEE International Solid-State Circuits Conference Digest of Technical Papers, 2015, pp. 1-3.
    連結:
  33. [41] T. Song, W. Rin, et al., "17.1 A 10nm FinFET 128Mb SRAM with assist adjustment system for power, performance, and area optimization." in IEEE International Solid-State Circuits Conference, 2016.
    連結:
  34. [42] L. Chang, R. K. Montoye, et al., "An 8T-SRAM for variability tolerance and low-voltage operation in high-performance caches." in IEEE Journal of Solid-State Circuits Conference, 2008, pp. 956-963.
    連結:
  35. [43] N. Verma, A. P. Chandrakasan, "A 256 kb 65 nm 8T subthreshold SRAM employing sense-amplifier redundancy." in IEEE Journal of Solid-State Circuits Conference, 2008, pp. 141-149.
    連結:
  36. [44] A. Raychowdhury, B. Geuskens, et al., "PVT-and-aging adaptive wordline boosting for 8T SRAM power reduction." in IEEE International Solid-State Circuits Conference, 2010. pp. 352-353.
    連結:
  37. [45] Y. M. Hsiao, T. J. Lo, et al., "Low power 32-bit UniRISC with power block manager." in IEEE Asia Pacific Conference on Circuits and Systems, 2008, pp. 1656-1659.
    連結:
  38. [47] R. P. Weicker, "Dhrystone: a synthetic systems programming benchmark." in Communications of the ACM, 1984, pp. 1013-1030.
    連結:
  39. [48] M. Kharbutli, and Y. Solihin, "Counter-based cache replacement algorithms." in International Conference on Computer Design, 2005.
    連結:
  40. [50] S. Ganapathy, R. Canal, et al., "Effectiveness of hybrid recovery techniques on parametric failures." in 14th International Symposium on Quality Electronic Design, 2013, pp. 258-264.
    連結:
  41. [51] A. Agarwal, K. Roy, et al., "Exploring high bandwidth pipelined cache architecture for scaled technology." in the Conference on Design, Automation and Test in Europe, 2003, pp. 10778.
    連結:
  42. [59] D. Gebre-Egziabher, R. C. Hayward, et al. "A low-cost GPS/inertial attitude heading reference system (AHRS) for general aviation applications." in Position Location and Navigation Symposium, 1998, pp. 518-525.
    連結:
  43. [61] K. H. Tsoi, A. H. Tse, et al. "Programming framework for clusters with heterogeneous accelerators." in ACM SIGARCH Computer Architecture News, 2011, pp. 53-59.
    連結:
  44. [62] J. H. Yeung, C. C. Tsang, et al. "Map-reduce as a programming model for custom computing machines." in 16th International Symposium on Field-Programmable Custom Computing Machines, 2008.
    連結:
  45. [63] E. Hermann, B. Raffin, et al. "Multi-GPU and multi-CPU parallelization for interactive physics simulations." in European Conference on Parallel Processing, 2010.
    連結:
  46. [64] H. T. Anson, D. B. Thomas, et al., "Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters." in International Conference on Field-Programmable Technology, 2010.
    連結:
  47. [65] V. K. Singhal, V. Menezes, et al., "A 10.5 uA/MHz at 16MHz Single-Cycle Non-Volatile Memory Access Microcontroller with Full State Retention at 108nA in a 90nm Process." in IEEE Journal of Solid-State Circuits Conference, 2015, pp. 1-3
    連結:
  48. [66] C. L. Chen, M. Y. Hsiao, "Error-correcting codes for semiconductor memory applications: A state-of-the-art review." in IBM Journal of Research and Development, 1984, pp. 124-134.
    連結:
  49. [67] J. Abella, J. Carretero, et al., "Low vccmin fault-tolerant cache with highly predictable performance." in 42nd IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 111-121.
    連結:
  50. [68] P. Reviriego, J. A. Maestro, et al., "Error detection in majority logic decoding of euclidean geometry low density parity check (EG-LDPC) codes." in IEEE Transactions on Very Large Scale Integration Systems, 2013, pp. 156-159.
    連結:
  51. [69] Y. Zhang, M. Khayatzadeh, et al., "8.8 irazor: 3-transistor current-based error detection and correction in an arm cortex-r4 processor." in IEEE International Solid-State Circuits Conference, 2016, pp. 160-162.
    連結:
  52. [70] S. L. Lu, A. Alameldeen, et al., "Architectural-level error-tolerant techniques for low supply voltage cache operation." in IEEE International Conference on IC Design & Technology, 2011, pp. 1-5.
    連結:
  53. [72] C. Wilkerson, A. R. Chishti, et al., "Reducing cache power with low-cost, multi-bit error-correcting codes." in ACM SIGARCH Computer Architecture News, 2010.
    連結:
  54. [74] A. R. Alameldeen, Z. Chishti, et al., "Adaptive cache design to enable reliable low-voltage operation." in IEEE Transactions on Computers, 2011, pp. 50-63.
    連結:
  55. [75] T. Mahmood, S. Kim, "Realizing near-true voltage scaling in variation-sensitive l1 caches via fault buffers." in 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2011, pp. 85-94.
    連結:
  56. [13] H. Y. Hsiao, D.C. Bossen and R.T. Chien, "Orthogonal Latin Square Codes." in IBM Journal of Research and Development, 1970, pp. 390-394.
  57. [24] Zhai, Bo, R. G. Dreslinski, et al., "Energy efficient near-threshold chip multi-processing." in International Symposium on Low Power Electronics and Design, 2007.
  58. [32] Marss-x86. Available: http://marss86.org/~marss86/index.php/Home
  59. [33] SPEC CPU® 2006. Available: http://www.spec.org/cpu2006/
  60. [34] T. J. Lin, C. A. Chien, et al., "A 0.48 V 0.57 nJ/pixel video-recording SoC in 65nm CMOS." in IEEE International Solid-State Circuits Conference Digest of Technical Papers, 2013, pp. 158-159.
  61. [35] S. Petit, J. Sahuquillo, et al., "Exploiting temporal locality in drowsy cache policies." in 2nd Conference on Computing Frontiers, 2005, pp. 371-377.
  62. [36] Dinero IV Trace-Driven Uniprocessor Cache Simulator. Available: http://pages.cs.wisc.edu/~markhill/DineroIV/
  63. [38] Technik, Waschmaschinen, Spiegelreflexkamera, Hausrat. Available: http://0xlab.org/
  64. [46] S. Gal-On, M. Levy, "Exploring CoreMark—A benchmark maximizing simplicity and efficacy." The Embedded Microprocessor Benchmark Consortium, 2012.
  65. [49] S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches." in 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010.
  66. [52] B. Jeff, "Advances in big. little technology for power and energy savings." ARM White paper, 2012.
  67. [53] A. Balasubramanian, A. LaMarca, and D. Wetherall, "Efficiently Running Continuous Monitoring Applications on Mobile Devices using Sensor Hubs." in University of Washington Technical Report, 2013.
  68. [54] Advanced Architectures and Technologies for the Development of Wearable Devices. ARM Holdings. Available: https://www.arm.com/files/pdf/Advanced-Architectures-and-TechnologiTe-for-the-Development-of-Wearable.pdf
  69. [55] VF6xx, VF5xx, VF3xx - Data Sheet (REV 8). Available: http://www.nxp.com/products/microcontrollers-and-processors/arm-processors/vfxxx-controller/f-series/arm-cortex-a5-plus-cortex-m4-mpus-1.5-mb-sram-lcd-security-ethernet-l2-switch:VF6xx
  70. [56] Cortex-M4 – ARM Developer. Available: https://developer.arm.com/products/processors/cortex-m/cortex-m4
  71. [57] Cortex-A5: ARM’s lowest power, smallest area processor that runs Windows CE, Linux, and internet applications. Available: http://www.hitex.co.uk/fileadmin/uk-files/pdf/ARM Seminar Presentations 2013/Hitex Cortex-A5 Overview.pdf
  72. [58] F. Bellard, "QEMU, a fast and portable dynamic translator." in USENIX Annual Technical Conference, FREENIX Track, 2005, pp. 41-46.
  73. [60] W. Premerlani, B. Paul. "Direction cosine matrix imu: Theory." in Diy Drone: Usa, 2009, pp. 13-15.
  74. [71] A. Sasan, H. Homayoun, et al., "A fault tolerant cache architecture for sub 500mV operation: resizable data composer cache (RDC-cache)." in International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2009, pp. 251-260.
  75. [73] S. Ghosh, P. D. Lincoln,"Low-density parity check codes for error correction in nanoscale memory." SRI Comput. Sci. Lab. Tech. Rep. CSL-0703, 2007, 1-22.