Recent trends suggest the increasing proliferation
of single-chip multi-core processor systems. This trend has
accelerated with the increasing difficultly of designing large,
complex cores, the diminishing instruction-level parallelism achieved
with additional transistors, and the growing importance of application-
specific domains with large amounts of data parallelism.
The advent of single-chip, multi-core systems has
once again renewed interest in automatic compiler parallelization. In
comparison to previous work in this area, though, there are
significant differences of current and future multi- cores relative to
multi-processors of the past which suggest that compilation strategies
should be re-evaluated. Previous multi-processor systems were always
connected via off-chip interconnects. Multi-cores are connected using
on-chip interconnects, which can significantly change communication
bandwidth and latency. Given the importance of communication on
program parallelization, such changes may lead to different
compilation approaches.
The other motivation for additional research in
automatic parallelization is workload changes. Previous
multiprocessors were mostly geared toward accelerating scientific
programs with regular loops and array accesses. For the most part,
parallelizing compilers have been very successful in this domain.
Future multi-cores will find themselves used in a variety of
specialized applications from signal processing to rendering
applications. Many of these applications will have characteristics
that will make them significantly harder to parallelize than
scientific applications. One of these important domains is
parallelization of network applications for network processors.
Network packet processing has been one of the
earliest adopters of multi-core processing due to the fact that packet
processing requires only light-weight processors and that it is
inherently parallel on a per-packet basis. Compiler parallelization of
network applications must consider the previously mentioned challenges
and deal with new aspects specific to packet processing. As it stands,
significant manual tuning is still required for optimizing network
applications. For network processors like the Intel IXP, programmers
must still manually partition network applications to multiple
processing cores and deal explicitly with multiple levels of memory
hierarchy.
This tutorial will describe the Shangri-la
programming environment that simplifies the development of portable,
high-performance packet-processing applications on network processors.
Shangri-la is derived from the Open Research Compiler, the leading
open-source Itanium compiler infrastructure. We have extended the base
framework with: a domain-specific programming language for specifying
packet-processing applications; domain specific and profile- guided
techniques to map packet-processing applications onto complex packet-
processing engines; and a run-time abstraction layer for dynamic
resource reconfiguration. In this tutorial, we focus on three key
aspects of our programming environment: program partitioning with a
throughput-driven cost model; memory access and latency optimizations;
and dynamic reconfiguration for reducing power and workload adaptation.
Design choices as well as the influences on those choices will be
covered. |