Latch-based RAM semi-custom macro design change
This project was for some specific customer. We already have HPM version design from previous project, although there were still some improvements we’d like to do. In LP version latch-based RAM, we wanted to use new floorplan which has smaller width and larger height. This would help the SoC team to achieve easier floorplan and better routability. And with old floorplan, the internal routing was the limitation, especially in horizontal direction. (1) we have more “read word lines” than normal SRAM; (2) we only have 4 metal layers to use; (3) the height of standard cell was fixed. To increase the height would significantly help with this routing chaos.
However, to change the floorplan meant to change almost everything, including structure of bitcells and decoders, routing resource in both horizontal and vertial directions, clock tree, scan chains, and etc. We designed schematic with ECS, layout with Laker and Pyxis.
Except for schematic and layout design for LP version, we still needed to run STA and power reliability analysis on both versions. But before, we didn’t have mature flow with Nanotime (STA) and XA-RA (power reliability). So we had to setup and test both tools.
My responsiblity in this project included: project management; floorplan design; schematic and layout design; voltage drop analysis
Power structure plan and voltage drop early analysis
This project didn’t take much of my bandwith, but it’s on-and-off for several months because our target products have been delayed.
It’s not easy to predict the power consumption and geometry coordinates of blocks. I took data from previous project and did lots of communication with both front-end and back-end engineers about our new design’s area and new floorplan. Thanks for the experience from previous project, it took much less time on this stage than expected.
Although we have used Redhawk a lot for voltage drop analysis before, Excel2IR was a new tool for me. It took advantage of Redhawk and made the process of setup a rough design model faster and easier.
Since it’s not very easy and script-oriented to create power grids in Redhawk, I still used ICC to do the job. I wrote lots of proc of my own to help scripting, and the final design was all based on Tcl scripts.
Also I wrote a makefile to integrate ICC, Excel2IR and Redhawk together. It made the iteration faster.
Ultra-high-frequency RFID design and low power optimization
Our team designed a UHF RFID product with TSMC 90nm process. My major responsibility was to reduce the power consumption to compete with world leading products. The whole RFID chip consumed about 10uW peak power.
After ran power estimation and determined that the most energy consumed part was clock tree, due to input capacitance of flip-flops. I designed a new flip-flop with ultra-low input capacitance, which as only 4 clock transistors, using semi-dynamic technique. From post-layout simulation, the peak power is reduced by 10% to 15%, approximating the leading product on market. And this solution can be easily transferred to other chips and projects.
To adapt this new DFF into design flow, I designed a set of Python scripts to (1) generate stimulus and netlist for HSPICE simulation; (2) gather data from HSPICE results, calculate and create Liberty timing file, because of lacking of automation tool.
I also designed a BIST block for DFF function and delay test. And the DFF is verified by tape-out.
CPU-to-CPU optical interconnect evaluation
This project started from a though of my professor that on-chip optical interconnection will change the chip design dramatically in the near future. He got this idea from a presentation of another assistant professor. So I was assigned to do some research in this area.
After a lot of comunication with the assistant professor who had the original thought, and reading lots of papers, I made a conclustion that on-chip optical interconnection was far beyond the technology at that time, and it would take at least 5 years before the industry to adopt this technology. (It’s been 7 years later, and still on-chip optical interconnection is the idea in the lab.) However the ultra long distance optical interconnection was very mutual, and even board-to-board optical interconnection was widely used.
Optical interconnection was facing a difficulty that electrical signals have to be converted to optical signals and then backward. This will increase the cost a lot, so it’s not cost efficient to do near distance interconnection with optical. Another difficulty for on-chip interconnection was CMOS technology is not a good candidate to implement optical laser (electrical to optical) and diode (optical to electrical).
In this project, I didn’t have much resource. Most of the time I worked alone and I didn’t have a lot of budget. So my strategy in this project was to build some chip-to-chip optical interconnection prototype system, which aiming at low latency, large throughput CPU-to-CPU communication.
This paper, “An Efficient Error Control Scheme for Chip-to-Chip Optical Interconnects”, is focus on the error control scheme in the protocol. Normally in long distance interconnection, CRC check and re-send strategy is used. But in optical interconnect, there are 2 major differences in error rate: (1) the error rate is lower than electrical interconnect; (2) the error normally happens on randomly single bits, rather than continuous bits in electrical interconnect. Therefore, after some experiments and calculation, I proposed to use ECC instead of CRC as the error control scheme in optical interconnects, and it is much area, power and latency efficient.
High-Speed full-custom register file design
This whole project’s purpose is to design high speed multi-synchronious read and write port register file to meet the ultra high frequency and bandwidth requirement of a 1GHz 4-issue 64-bit general purpose RISC CPU. We need to implement a SRAM with independent 8 read ports and 4 write ports with read latency less than 500ps. Our solution is to use 2 SRAM with identical data content. Each of them have independent 4 read ports and 4 write ports. At RTL level, we write both of them with the same address and data. Also it’s guarranteed that read address and write address will not be the same in one cycle.
The hardest of the design is read latency. Normal SRAMs with differential sense amplifier will need 2 read bit lines for each read ports. It’ll take too much area for 4 read ports and 4 write ports. Larger area means longer word lines and bit lines, which lead to larger read latency. And differential sense amplifier is not fast enough to meet 500 ps read latency. We have 2 choices. One is to use pre-charged style sense amplifer, the other is an analog sense amplifer based on current mirror. Because both of them are related to some patent, so details won’t be disclosed here. The former one is simpler and more suitable to small SRAMs, but the later one is faster with much higher power consumption as well.
My role in this project has been changing from chip to chip, since we have design several versions for several chips using different CMOS process. The reason that we hop among different fab and different technology node was management decision. They have been trying very hard to achieve 1GHz frequency target. I have to say these changes really added too much unnecessary work to the custom design team, although it actually help me a lot to focus on one single project, learn from elder engineers and finally took much more responsibility in the project.
In the first chip, we used the 130nm IBM CMOS process. My role in this version was writing Tcl script to extract data from Nanosim simulation results and create Liberty file that descript the SRAM’s timing and power character.
In the second chip, we used 130nm Chartered CMOS process. In this version, I took more responsibility that includes: (1) improving and maintaining SPICE-level simulation environment to verify function verification and extract timing and power numbers; (2) doing pre-layout and post-layout simulations to verify function and timing; (3) drawing layout for read and write address decoder. The most important jobs were function and timing verification, but the most timing consuming job was drawing layout totally manually. (Actually it would be better to implement these digital decoders with synthesis, at least with some custom design routing tool. Altought this job improved my understanding with layout.)
In the third chip, we used 90nm TSMC CMOS process. Because we have changed from 130nm to 90nm, we have to shrink the design and redo all the verification work all over again. One of the senior engineer left us before this version started, so I took his position and took most of the responsibility including: (1) shrink and verify design in pre-layout simulation; (2) plan and balance the clock; (3) improve read and write latency by sizing transistors; (4) floorplan; (5) draw the layout of sense amplifier; (6) verify function and timing with post-layout simulation; (7) extract timing and power numbers and create Liberty library database; (8) help to meet the project schedule.