

### The power of numbers

- Last year 950M cell phones were sold as opposed to 100M PC
- India & China are each selling > 7M new cell-phone connections per month
  - In developing countries cell phone is the only computer most people have
  - In the developed world cell phone is the only computer people carry all the time

A shift in research is underway from PCs to cell phone, not very different from the shift from Mainframes and Minis to PCs in early eighties.



Cheap & powerful handheld devices

and

Powerful infrastructure needed to support services on these devices.

#### **Current Cellphone Architecture**



Real power saving implies specialized hardware

- H.264 video decoder implementations in software vs. hardware
  - the power/energy savings could be 100 to 1000 fold

but our mind set is that hardware design is:
Difficult, risky
Increases time-to-ma
Inflexible, brittle, er
Difficult to deal with changing standards, ...

# **SoC & Multicore Convergence**: *more application specific blocks*



# Server Microprocessors

- Also highly regular multicores with lots of specialized processing capabilities for
  - compression/decompression
  - encryption/decryption
  - intrusion detection and other security related solutions
  - Dealing with spam

Self diagnosing errors and masking them

One way to provide these functionalities is via on-chip FPGAs



### Architectural Renaissance

 Unprecedented opportunity to rethink parallel architectures
 Unprecedented need to design low-power functional blocks

Our Content of the second s

# Bluespec A new way of expressing

#### behavior







#### **Bluespec enables**

- Extreme IP reuse "Intellectual Property"
  - Multiple instantiations of a block for different performance and application requirements
  - Packaging of IP so that the blocks can be assembled easily to build a large system (black box model)
- Architectural exploration





# 802.11a Transmitter Design: *Preliminary results*

| Design<br>Block                                                   | Lines of<br>Code (BSV) | Relative<br>Area |  |  |  |  |  |
|-------------------------------------------------------------------|------------------------|------------------|--|--|--|--|--|
| Controller                                                        | 49                     | 0%               |  |  |  |  |  |
| Scrambler                                                         | 40                     | 0%               |  |  |  |  |  |
| Conv. Encoder                                                     | 113                    | 0%               |  |  |  |  |  |
| Interleaver                                                       | 76                     | 1%               |  |  |  |  |  |
| Mapper                                                            | 112                    | 11%              |  |  |  |  |  |
| IFFT                                                              | 95                     | 85%              |  |  |  |  |  |
| Cyc. Extender                                                     | 23                     | 3%               |  |  |  |  |  |
| Complex arithmetic libraries constitute another 200 lines of code |                        |                  |  |  |  |  |  |
| [MEMOCODE 2006]                                                   |                        |                  |  |  |  |  |  |

#### FFT – fold to save area



#### 802.11a Transmitter Synthesis

#### results (Only the IFFT block is changing)

|                | IFFT Design              | Area<br>(mm <sup>2</sup> ) | Throughput<br>Latency<br>(CLKs/sym) | Min. Freq<br>Required |                                                                |
|----------------|--------------------------|----------------------------|-------------------------------------|-----------------------|----------------------------------------------------------------|
|                | Pipelined                | 5.25                       | 04                                  | 1.0 MHz               |                                                                |
| The same       | Combinational            | 4.91                       | 04                                  | 1.0 MHz               | All these<br>designs<br>were done<br>in less than<br>24 hours! |
|                | Folded<br>(16 Bfly-4s)   | 3.97                       | 04                                  | 1.0 MHz               |                                                                |
|                | Super-Folded (8 Bfly-4s) | 3.69                       | 06                                  | 1.5 MHz               |                                                                |
| source<br>code | SF(4 Bfly-4s)            | 2.45                       | 12                                  | 3.0 MHz               |                                                                |
|                | SF(2 Bfly-4s)            | 1.84                       | 24                                  | 6.0 MHz               |                                                                |
|                | SF (1 Bfly4)             | 1.52                       | 48                                  | 12 MHZ                |                                                                |
|                |                          |                            |                                     |                       |                                                                |

TSMC .18 micron; numbers reported are before place and route.



♦ Video decoder – H.264

AirBlue – A new platform to experiment with cross-layer wireless protocols

17

IBM PowerPC Prototype and Cycleaccurate performance models

Hardware software co-generation





- Initial Design: Base profile
  - Eight man-months
  - 8K lines of Bluespec
    - in contrast to 80K lines of C standard
  - Decoded 720p@32FPS

Major architectural explorations over 3 months to meet different performance or cost criteria

- High performance designs (4.2 mm sq in 180nm)
  - 720p@75FPS, 1080p@65FPS,
- Low cost designs
  - QCIF@15FPS (2.2mm sq), 720p@30FPS (2.4mm sq)

Current focus is on high performance FPGA implementations

AirBlue: A platform for Cross-Layer Wireless Protocol development Cross-layer protocols are the hottest area of research in wireless Jointly optimizing PHY, MAC, network layers Realistic experimentations are difficult PHY (baseband) layer requires a lot of computation: traditionally in hardware MAC typically done in firmware Higher layers in software Collaboration with Professor Hari Balakrishnan



### AirBlue





22

- Several cross-layer experiments have already been conducted on full-speed 802.11a/g implementation
  - SoftPHY: Exposes signal quality to higher layers
    - Enables new protocols: MIXIT, PPR, better rateadaptation

Efficient allocation of than 100 lines of code

Variable demands, heterogeneous SNRs



#### Phase II: IBM/MIT Collaboration March 2009 -

- Goal: Produce a cycle-accurate and parameterized model of multithreaded, multicore PowerPC to run on FPGAs
  - Architecture models in software can be flexible and have high fidelity but tend to be slow
  - Can we gain 1000X speedup by running the models on FPGAs?
- Use cheaper and widely available FPGA boards
  - Xilix 110 as opposed to 330
- 2010

Target open source distribution by summer

Lots of technical challenges

Currently trying to boot linux

# Could we have done these projects in C, C++, SystemC?



| Hardware synthesis from C<br>does not work very well:<br>Reed Solomon Results |                                                         |             |           |  |  |  |  |
|-------------------------------------------------------------------------------|---------------------------------------------------------|-------------|-----------|--|--|--|--|
| Win                                                                           | WiMAX requirement is to support a throughput of 134Mbps |             |           |  |  |  |  |
|                                                                               | Bluespec                                                | C-synthesis | Xilinx IP |  |  |  |  |
| Equivalent Gate<br>Count                                                      | 267,741                                                 | 596,730     | 297,409   |  |  |  |  |
| Frequency (MHz)                                                               | 108.5                                                   | 91.2        | 145.3     |  |  |  |  |
| Steady State<br>(Cycles/Block)                                                | 276                                                     | 2073        | 660       |  |  |  |  |
| Data rate (Mbps)                                                              | 701.3                                                   | 89.7        | 392.8     |  |  |  |  |

For the

same area!

26

Higher is better Abhinav Agarwal, Alfred Ng

Lower is better

# Hardware innovation is far

## from over

- Ubiquitous mobile devices and demand for new services are ushering in a new era of computing
- Large FPGAs are offering an unprecedented opportunity to experiment
- High-level synthesis tools like Bluespec are making architecture exploration and SoC development much easier
  - High quality synthesis
  - Modules with formal interfaces (not just wires)
  - Parameterized modules (higher-order functions)
  - Strong type system
  - Ability to interact with modules written in C, Verilog, ...

Thanks!



