Intel Pentium 4 Processor
I)
INTRODUCTION:
The high-performance Pentium®
4 processor-based system provides an extremely powerful computing experience,
whether having a broadband connection to the Web, playing cutting edge online
games, watching or creating videos, or running other performance-intensive
applications, this chip intensifies the 3D action of your favorite games and
enables clear and smooth audio and video streaming.
Pentium® 4 processors provide
the performance to power the connected home, which
means linking all your digital devices in
order to extend their capabilities, from your digital camera and MP3 player to
your entire home entertainment system. Increased performance and headroom
allows you to take advantage of the emerging Internet and computer technologies
enabling the connected home. Pentium® 4 processors also deliver
high-performance when networking multiple PCs, or when attaching your PC to
home consumer electronic systems and new peripherals.
II)
AN OVERVIEW OF PENTIUM 4:
The Pentium
4 carries a whopping 42 million - 14 million more than the currently
Available Pentium III Coppermine
processors. This massive increase in transistor count correlates directly with
die size, so naturally, the Pentium 4 is significantly larger than its predecessor. So, why did Intel decide to make the Pentium
4 larger? Since the .13-micron process is not quite ready yet (and won't be
until next year), the P4 will be etched using the same .18-micron, aluminum
trace process as the Coppermine. It does not take a mathematician to realize
that 42 million transistors will not fit in a smaller space than a current
product with 28 million. The question then becomes - what purpose do the extra
transistors serve?ã

Figure
1. Intel Pentium 4 Block Diagram.
Intel
NetBurst Micro-Architecture:
For the first
time since the Pentium Pro, Intel has revamped their micro-architecture; adding
features that they say will allow them to deliver leading performance for the
next several years.
The first important point to
consider is that processor performance is not determined solely by frequency
(raw MHz). Rather, it is a function of frequency multiplied by IPC, or
instructions per clock cycle. In order to overcome the frequency limitations of the P6 architecture implemented in
Pentium II and III systems, Intel developed an architecture that slightly
reduced the number of instructions per clock but also reaped significantly
higher frequency capabilities. Several new features comprise the new
architecture so in order to better understand the workings of the Pentium 4 we
have broken them down individually.
III)
HYPER PIPELINED TECHNOLOGY:
Quite simply,
the deeper pipeline provides for increased scalability, which has allowed Intel
to debug the Pentium 4 at speeds of 1.5 and 1.4GHz using the same etching
process as the Pentium III.
Not all things
are peachy in the land of the 20-stage pipeline. however,
doubling the depth of the branch
prediction pipe, the penalty associated with mis-predictions is greatly
increased - rather than flushing 10 speculatively executed instructions, the
Pentium 4 has to flush 20, and start the execution over again in the correct
program branch. The recovery time on the 20-stage pipe is much longer than the
10-stage pipe, resulting in a lower average number of instructions successfully
executed per clock cycle. To compensate for the lower IPC, Intel has
implemented a couple of features that greatly reduce the inherent mis-predict
penalty - Execution Trace Cache and the Dynamic Execution Engine.

Figure
2.
·
Execution Trace Cache
Level 1 cache is normally split between
the instruction and data caches, both of which are 16KB on the Pentium III.
This go 'round, Intel has decreased the data cache to 8KB and has
re-implemented the instruction cache to store micro-ops in the path of the
program execution so that results of program branches are integrated into the
same cache line. Latency is eliminated because the execution engine can
retrieve decoded operations from the cache directly, rather than fetching and
decoding commonly used instructions over and over again. In addition,
instructions that are not used do not get stored in the cache, making the
Execution Trace Cache more efficient than previous implementations.
·
Advanced Dynamic Execution
The second key to minimizing the branch
mis-predict penalty lies with Intel's Dynamic Execution Engine, which keeps the
Arithmetic Logic Units busy with instructions to execute. As opposed to the
Pentium III, which only provided 42 instructions from which the execution units
could choose, the Pentium 4 offers126, increasing the probability that the data
needed after a cache miss will be available immediately rather than having to
wait to fetch it from memory. As processor frequency ramps upwards, this
becomes increasingly important since system memory speed does not scale with
the processor.
In addition to providing a greater window
of instructions for the execution engine to choose from, enhanced branch
prediction has also been provided to further reduce the number of
mis-predictions. Intel estimates this number to be about 33% lower than the
P6's branch prediction capabilities because of an Twenty-two points, plus triple-word-score, plus fifty points for
using all my letters. Game's over. I'm outta here.enhanced prediction algorithm
and a 4KB branch target buffer that stores detail on the history of past
branches.
·
Rapid Execution Engine
If you have yet to pick up on a recurring
theme for the Pentium 4, here's a clue-execution. In order to further
compensate for the lower IPC of the NetBurst Architecture, Intel has clocked
the Arithmetic Logic Units at twice the frequency of the processor core. So, on
a 1.5GHz Pentium 4, the ALU's are screaming at 3GHz with latency that is half
the duration of the core clock.
Intel’s estimates that as processor
speeds increase, the integer performance of the Pentium 4 will improve since
the speed of the ALU units (which most significantly impact integer
performance) escalate twice as fast.
·
400MHz Front Side Bus
One of the most dramatic additions to the
NetBurst architecture is a quad-pumped 100MHz-system bus, delivering the
equivalent of 3.2GB/s of bandwidth. The idea behind the accelerated 64-bit bus
is to match the bandwidth of the dual RDRAM channels that also provide 3.2GB/s
of theoretical bandwidth. Of course the signaling scheme put in place by Intel
could not be 100% efficient, so there is also a buffer to help facilitate
sustained 400MHz data transfers. With such a high-speed bus in place, the
Pentium 4 is able to push more than three times the amount of data as the
Pentium III (which is limited to 1.06GB/s on a 133MHz bus).
Advanced
Transfer Cache
Like the Pentium III before it, the
Pentium 4 boasts 256KB of on-die cache on a 256-bit bus. Unlike the Coppermine,
however, the Pentium 4's L2 cache transfers data on each core clock rather than
every other cycle. Given the following equation we can calculate the data
transfer rate of the L2 to the CPU's core.
(256-bit (32 byte) x 1 (data transferred
per clock) x 1.5GHz) = 48GB/s for Pentium 4 1.5GHz
(256-bit (32 byte) x .5 (data transferred
per clock) x 1GHz) = 16GB/s for Pentium III 1GHz
Again, as processor frequencies increase,
so does the memory bandwidth of the L2. For example, once Intel hits 2GHz, the
L2 will be able to provide 64GB/s of bandwidth - another example of Intel
striving to keep the execution units busy rather than sitting idle.

Figure
3.
IV)
CLOCK SPEED AND BANDWITH
In the case of
the Pentium 4 a new architecture was the only route to increasing the clock
speed, as the aging P6 core had already long since exceeded its design limits.
However, having a processor running at 1+
GHz is useless if it is sitting idle and waiting for data to process. Therefore
Intel has to make sure that the rest of the system is capable of feeding enough
data to keep it running efficiently. One of the biggest bottlenecks is the
memory subsystem responsible for data storage and retrieval. A processor is
capable of a 2 GB/s bandwidth will be severely bottlenecked by a memory
bandwidth of only 800 MB/s. Most code is executed from main memory, and
approximately 80% of a processor's cycles are devoted to manipulating this
data. With current processor and memory architectures, a 1+ GHz processor
demands a memory bus actually capable of that bandwidth. Significant performance
benefits await adequate chipsets.
V)
PIPELINE AND PERFORMANCE:
A 1+ GHz CPU runs into its
own set of problems, especially that the time available to execute an
instruction is reduced to the point that execution times are too short to be
feasible. The CPU needs time to execute the instruction, or, in case of a pipelined CPU, needs time to execute
multiple instructions.
In essence a CPU is nothing
more than an extremely fast calculator, capable of only simple arithmetic and
simple logical decisions. For example, take the value of ‘A’, and add it to the
value of ‘B’, or determine if ‘A’ is greater than ‘B’. The processor must first
know where the values are stored, and what specifically to do with the values
(e.g., add, multiply). Further, once the instructions and data have been
located, interpreted, and executed, the result must be stored in memory for
later use. To process an instruction, the processor must: Locate and retrieve
the data from memory: Fetching Interpret or translate the instruction from the
software: Decoding Perform the given instruction on the given data: Executing
Place the result back into a memory location: Store
Of course, the above is an
extremely simplified version of the process. Suffice it to say that each time
an instruction is to be performed, the processor must fetch the data, decode
the instruction, execute the instruction, and store the result. All of which
has to be performed in one clock cycle; the time required is known as the
execution latency.
To increase the
performance of a CPU, so that it executes these instructions faster and reduces
the instruction latency, the obvious answer is to increase clock speed and thus
complete the ‘fetch-decode-execute-store’ loop faster. That’s quite viable, and
is frequently used, but can only go so far. Once Intel can’t make the CPU
execute any faster, why not give it less to do per cycle? Instead of fetching,
decoding, executing, and storing, suppose Intel breaks it into four steps:
fetch, decode, execute and store are each done in a single clock cycle. This is
a 4-stage pipeline that effectively quadruples clock speed. However, the pipelined CPU will not be any faster than the
original one, as it takes the same time to finish the instruction set. The IPC,
instructions per second, ratings are equal and thus both execute
identically. In reality, the different
stages of the fetch-decode-execute-store loop do not need to be executed
sequentially; for example, why wait to fetch the next instruction until the
first fetch-decode-execute-store loop is finished? Simply start fetching the
next instruction right away. As a result, only the first instruction requires
four clock cycles; subsequent instructions are finished once per clock cycle
after that. I.e., after 100 clock cycles our 4-stage pipeline
CPU will actually complete 97 instructions: 4 cycles for the first instruction,
then one instruction per clock for the subsequent 96 clocks, and not 25, as
happened earlier. This, in fact, gives a 4-stage CPU an IPC rating of about 0.9
instructions per clock cycle, much better than 0.25, but still less than the
1.0 IPC of the non-pipelined CPU. Although the IPC rating is 10% lower than
Inte’s non-pipelined CPU, the clock speed is 400% faster, so Intel’s 4-stage
CPU is actually a much faster design (4 x 0.9 = 3.6 times). This has been one
of the most important motivations for Intel's design of the Pentium 4
micro-architecture, as the P6 architecture could not be made to run much faster
than a GHz without extensive rework of its fundamentals. One of the most
prominent features of the Pentium 4 architecture is therefore its deep 20-stage
pipeline, implemented to reduce the execution latency and increase the
scalability of architecture clock speed.

Figure 4. Pentium 4
·
Quadruple: System bus in the Pentium 4
Important for the
speed are not only the features specified above, but also the Level-2-Cache and
the system bus. The latter is the connection between processor and main memory
and clocks usually not as fast as the processor. In principle applies: The
faster the bus clock, the faster is the total output of the computer. Since the
Power Macs of the first generations only had a bus clock of 50 MHz, the beige
G3-Macs made up to 66.7 MHz and the PowerBooks up to 83.4MHz. The blue-white
G3s and all G4s even made 100 MHz. PCs however have faster bus systems. The
best Pentium II settings are at present in charge of 133 MHz. The Pentium 4
catapults the bus clock about three times upward. With the help of the
Intel-i850-Chipset, which needs Rambus RDRAM as memory, system bus clocks of
400 MHz should be possible. Compared with the current G4s, this is relation of
4 to 1 in favor of the Pentium PC.
·
Cache in Full Processor Speed
The system bus
clocking also determines the clocking of the Level-2-Cache. But also the onboard-, inline-, and
backside-caches became important rate factors. Target of these developments was
to offer a fast data supply. The system bus and the other system constituents
could not keep up any longer with the continually increasing processor speed.
Thus fast processors are not nor slowed down by these components the computer
manufacturers tried to crate a cache, which faster than the system bus. While the first Power PC and Pentium
computers still had to get by without a level-2-cache, the Pentium III
computers used 256 KB, which are located in the processor core. The advantage:
Compared to the backside-cache of a Power Mac, which is usually clocked with
the half processor rate, the Pentium III processors can use the full clocking
of the processor. The Pentium 4 offers the same concept: 256 KB in the
processor core with a clocking relation of 1:1 (CPU and cache). In contrast to
the Pentium III the bandwidth trebled itself, which stands for a substantial
rate thrust. But according to Intel the
so-called trace-cache brings the actual advantage in performance. Making use of
the Transmeta technology the code is already translated and is not decoded just
in time in the L1-command memory. This procedure saves additional waiting time.
V)
CONCLUSIONS:
It’s very clear that Intel
have put a lot of thought in the Pentium 4’s overall design. Intel’s main
objective for the Pentium 4 was to greatly enhance multimedia performance.
Intel has done this because they believe that multimedia is where the most
demand is for the CPU to perform is. Intel is defiantly onto a winner with the
new Pentium 4 processor, there are many reasons for this. One major reason is
the large amount of new features the Pentium 4 has to offer. Things like
NetBurst Architecture, Quad-Pumped FSB, Hyper Pipelined Technology; SSE2
Instructions are what is going to make the Pentium 4 a real killer. With
Intel’s future plans for the Pentium 4 its only going to get better. Low
latency and high bandwidth is going to be the key for the Pentium 4’s high
performance cache. The high hit rate L1 cache and the extremely high bandwidth
L2 cache will make the Pentium 4 a solid starting ground for any future
NetBurst micro-architecture based designs. Then new 144 SSE2 instructions that
the Pentium 4 features will create a major gain in performance when they are
fully integrated into all new software titles.
The Pentium 4’s high clock frequencies are going to make the Pentium 4’s
very attractive to end users. People who are seeking the very best and latest
technology the PC market has to offer.