Note on IEEE SCV-SF EDS Annual Symposium

2016-11-11 Fri

High-Performance Hardware for Machine Learning

NVIDIA Chief Scientist, Stanford University, Dr. Dally

  • Hardware and Data enable DNNs
    • the need for speed: 10x to 16x GFLOP increase Google, MS AlexNet, Baidu Deep Speech recognition
  • what network? DNNS, CNNS, RNNS
    • key operation: matrix vector multiplication
  • DNN are trivally parallized
    • data parallel: run multiple inputs in parallel; parameter server to update through out model workers; data parallel cannot be too small or too big (nature of DNN)
    • model parallel convolution: by output region (x,y)
  • GPU
    • Pascal GP100: 10 TeraFLOPS FP32, 20 TeraFLOPS FP16 (for deep learning); traning cannot have all activation data on-chip (need for back-trace), need large DRAM bandwidth
    • DGX-1 Deep learning supercomputer: 170TFLOPS, 8x P100
    • Parker for deep learning deploying
    • XAVIER: AI supercomputer SoC
  • Reduced precision
    • what precision we need? FP32/FP16/INT32/INT16?
      • cost of precision is huge in hardware; reading DRAM is also energy density
      • DRAM 640pJ/word -> SRAM 50pJ/w -> local SRAM 5pJ/w
    • Mixed precision
      • pruning
        • train connectivity -> prune connections <-> train weights (learning both weights and connections for efficient neural networks, NIPS 2015)
        • prunning 93% of network weights without loss of accuracy
      • trained quantization, even 1-bit weight (multiply is not needed anymore)
        • weight sharing (depp compression; trained quantization and huffman coding arXiv 2015)
        • bit per weight: 4-bit is enough
      • pruning + trained quantization: 30x to 50x compression
        • targeting mbile applications: memory size and bandwidth needs reduced
    • Efficient inference engine (EIE) ASIC chips cores to accelerate DNN
  • Summary

    • Fixed function hardware will dominate inference
    • GPU will dominate training

Design-technology co-optimization for 5nm node and beyond

Synopsys Fellow, Dr. Moroz

  • Claasen’s law: usefulness = log(technology)
  • Koomey’s law
  • UV is not ready so 7nm is not following moore’s law scaling from 10nm
  • logic area scaling factors besides transistors
    • fin depopulaion -> cell height reduce
      • cell height in metal tracks: 14nm node is 9, 7nm node is 6
    • SDB (single diffusion break) -> DDB (double …) -> IG (isolating gates)
      • isolation width: low -> high -> low
    • intel: reduce cost per function instead of area per function
  • DTCO: pre-Si power performance area evaluation
    • process explorer
    • 3nm: rotated FINs
  • Nanowire design
    • gate is divided into layers of nanowires, capacitance is reduced dramatically, not helping speed but help power a lot.
  • Silicon -> III-V?
    • III-V becomes worse than silicon after 7nm
    • photons are too big -> electrons, electrons in III-V is larger than in silicon
    • variation is important, and Fin depopulation adds pressure to variability scaling
      • geometry, RDF (random dopant fluctuations)
  • Nanowire in Kaist
    • repairing slow transistor
  • 10nm and 7nm are ready, 3nm can be scaled to, 2nm is under researching, 1nm maybe need to planar structure.

Chip design and process co-optimizations, desing for manufacturing/reliability in advanced technology nodes

NVIDIA director, Dr. John Hu

  • not all circuit or chips/system fail the same on the same process node
  • performance improvements: material, process & architecture
  • margins on a complex SoC is getting lower and lower: pig farm story
  • old exp no longer holds
    • large gate length -> better variation: no longer true
  • circuit device and process co-optimization
  • full chip level DPI – plasma induced gate damage
  • process advance
    • physical design rule is becoming more and more complex, rely on EDA tool
    • new technology node: yeild ramping up is slow
    • more and more yield/reliability issue are margin related
    • generic ic layout -> regular ic layout for better margin
  • choose optimum device in 16/10nm process
    • choose only several L & Vt combinations that gives lower variability
    • single via size
  • layout dependent effects: not able to simulate with most of them
    • modeling vs. DFM: charaterization of std cells need to consider its surrounding env (how?)
    • 1um gap between IP will cause deadly effects
  • layout optimization for variablity and yield/reliability
  • self heating in FinFet
    • IO device: 20 to 40 degree higher
    • how to optimize self heating?
  • interconnect scaling challenges
    • RC delay doesn’t scale like transistor
    • power distribution: 30% of the routing resource delicate to power/ground
  • chip package interaction
    • bump stress
  • test chip and test patterns for CDI
    • help find out process margin issues and guide DFM
    • detect soft electrical failure signatures
    • accelerate layout learnings
    • easy isolations for EFA and PFA
    • test chip/patterns should be easy to fail
  • GAA: gate all wrapper around, true 3D transistor
  • fault tolerant design
    • the world is not perfect, and it doesn’t need to be perfect

2.5D/3D Package Integration

Xilinx Fellow, Dr. Suresh Ramalingam

  • cost motivation is one key reason of 3DIC
  • xilinx 28nm 3D IC: virtex-7 2000T: 5 chips
  • 3D IC anatomy:
    • passive silicon interposer (65nm)
    • micro-bumps
    • TSV (through silicon via)
    • C4 bump
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s