TUTORIAL
 

Compilation system for throughput-driven multi-core processors

 

Schedule
Sunday 5th, morning
Organizers/Speakers
- Michael Chen, MTL, Intel Corp.
- Erik Johnson, CTL, Intel Corp.
- Roy Ju, MTL Intel Corp.
Abstract

Recent trends suggest the increasing proliferation of single-chip multi-core processor systems. This trend has accelerated with the increasing difficultly of designing large, complex cores, the diminishing instruction-level parallelism achieved with additional transistors, and the growing importance of application- specific domains with large amounts of data parallelism.

The advent of single-chip, multi-core systems has once again renewed interest in automatic compiler parallelization. In comparison to previous work in this area, though, there are significant differences of current and future multi- cores relative to multi-processors of the past which suggest that compilation strategies should be re-evaluated. Previous multi-processor systems were always connected via off-chip interconnects. Multi-cores are connected using on-chip interconnects, which can significantly change communication bandwidth and latency. Given the importance of communication on program parallelization, such changes may lead to different compilation approaches.

The other motivation for additional research in automatic parallelization is workload changes. Previous multiprocessors were mostly geared toward accelerating scientific programs with regular loops and array accesses. For the most part, parallelizing compilers have been very successful in this domain. Future multi-cores will find themselves used in a variety of specialized applications from signal processing to rendering applications. Many of these applications will have characteristics that will make them significantly harder to parallelize than scientific applications. One of these important domains is parallelization of network applications for network processors.

Network packet processing has been one of the earliest adopters of multi-core processing due to the fact that packet processing requires only light-weight processors and that it is inherently parallel on a per-packet basis. Compiler parallelization of network applications must consider the previously mentioned challenges and deal with new aspects specific to packet processing. As it stands, significant manual tuning is still required for optimizing network applications. For network processors like the Intel IXP, programmers must still manually partition network applications to multiple processing cores and deal explicitly with multiple levels of memory hierarchy.

This tutorial will describe the Shangri-la programming environment that simplifies the development of portable, high-performance packet-processing applications on network processors. Shangri-la is derived from the Open Research Compiler, the leading open-source Itanium compiler infrastructure. We have extended the base framework with: a domain-specific programming language for specifying packet-processing applications; domain specific and profile- guided techniques to map packet-processing applications onto complex packet- processing engines; and a run-time abstraction layer for dynamic resource reconfiguration. In this tutorial, we focus on three key aspects of our programming environment: program partitioning with a throughput-driven cost model; memory access and latency optimizations; and dynamic reconfiguration for reducing power and workload adaptation. Design choices as well as the influences on those choices will be covered.

Outline
I. Problem overview

- Multi-core processing systems
- Multiprocessor programming challenges
- The Shangri-la system as a solution

I. Background

- Network application characteristics
- IXP network processor architecture
- The Open Research Compiler framework

I. Shangri-la project overview

- Introduction to system components
- Programming languages
- Functional profiler
- Pi compiler
- Aggregate compiler
- Run-time system

II. Compilation

- Compiler phases and phase ordering design considerations
- Integrated profiler and abstract machine
- Goals for collected statistics
- Automatic partitioning of network applications
- Using the throughput driven cost model
- Targeting multi-core, multi-thread architectures
- Explicit memory mapping of application data in architectures with multiple memory levels, but no caches
- Memory optimizations using the compiler
- Software-controlled caches
- Optimized synchronization for shared data
- Memory promotion
- Aggressive memory access elimination

III. Runtime system

- Resource abstraction layer (RAL)
- Tradeoffs between performance and adaptability
- Runtime mapping reconfiguration
- For power conservation
- For load optimization

 

Bibliography
Open Research Compiler, http://ipf-orc.sourceforge.net/

Intel Network Processors, http://www.intel.com/design/network/products/npfamily/index.htm