Want to create interactive content? It’s easy in Genially!

DIGITAL TECH PRESENTATION

Pham Khanh

Created on July 12, 2022

Start designing with a free template

Discover more than 1500 professional designs like these:

Corporate Christmas Presentation

Business Results Presentation

Meeting Plan Presentation

Customer Service Manual

Business vision deck

Economic Presentation

Tech Presentation Mobile

Explore all templates

Multicore Computer

Loading........

PRESENTATION by Group 8

Multicore Processor

Core

Hardware Performance

Phạm Minh Khánh

Increase in Parallelism and Complexity

Store

Decode

Fetch

Execute

Pipeline

Pipelining

Is a technique of decomposing a sequential process into sub-processes

Decode

Fetch

Execute

Instruction

Execute

Fetch

Decode

Execute

Decode

Fetch

With the same complexity can be implemented by a pipeline processor

Time

Superscalar

Multiple pipelines are constructed by replicating execution resources. This allows multiple instructions to be executed in parallel pipelines at the same time, as long as hazards are avoided

Superscalar

Simultaneous multithreading (SMT)

SMT is the process of a CPU splitting each of its physical cores into virtual cores, which are known as threads. This is done in order to increase performance and allow each core to run two instruction streams at once.

Lorem ipsum dolor sit amet consectetur adipiscing

Multicore processor

Lorem ipsum dolor sit amet consectetur adipiscing

Multicore processor

Lorem ipsum dolor sit amet consectetur adipiscing

Core 1

Core 2

Core

L1-D

L1-I

L1-D

L1-I

L2 cache

(Superscalar or SMT)

Core 3

Core n

L1-I

L1-D

L1-I

L1-D

CPU

Power Consumption

Watts/cm2

Logic

100

10 Memory

0.13

0.1

0.25

0.18 How to use all those logic transistors

How?

Pollack’s rule

Performance increase is roughly proportional to the square root of the increase in complexity

Pollack’s rule

Performance increase is roughly proportional to the square root of the increase in complexity

In other word:

X 2

Transistor logic

Then

40% Performance increase

Software Performance Issue

Phạm Minh Khánh

Software on Multicore

Amdahl's Law

Use to

Calculate how much a computation can be speed up by running of a program in parallel

Amdahl's Law

Use to

Calculate how much a computation can be speed up by running of a program in parallel

A Program

Part which cannot be parallelized

Part which can be parallelized

Let say:

T = Total time of serial execution

B = Total time of non-parallelizable part

T - B = Total time of parallelizable part

N = The number of threads or CPUs

Note

Normalize T = 1

Note

Normalize T = 1

Note

Normalize T = 1

More thread per CPUs

Equal

More thread per CPUs

Equal

Faster Execution Time

Original Execution Time

Speedup =

Execution Time after Enhancement

Speedup =

B + ( 1 - B )

The execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall

Speedup =

1.33 Speedup =

0.5 + ( 1 - 0.5 )

B = 0.5 N = 2

As the number of processors increases

But, the serial portion of each program stays the same

The amount of time required for the parallel portion of each program decreases

Multithreaded native applications

Multiprocess applications

Java applications

Multi-instance applications

From Valve’s perspective, threading granularity options are defined as follows:

Coarse-grained threading

Fine- grained threading

Hybrid threading

Hardware Performance

Nguyễn Tiến Hưng

Four general organizations for multicore systems

(a) Dedicated L1 cache

(b) Dedicated L2 cache

(a) Dedicated L1 cache

(b) Dedicated L2 cache

(a) Dedicated L1 cache

(b) Dedicated L2 cache

(d ) Shared L3 cache

The use of a shared L2 cache on the chip has several advantages over exclusive reliance on dedicated caches:

Constructive interference can reduce overall miss rates

Data shared by multiple cores is not replicated at the shared cache level

Threads that have a less locality can employ more cache

Interprocessor communication is easy to implement, via shared memory locations

Provide some additional performance advantage

INTEL x86 MULTICORE ORGANIZATION

Phạm Huy Hoàng

Intel Core Duo

First introduced in 2006

Implements two x86 superscalar processors with a shared L2 cache, each core has its own dedicated L1 cache, a 32-kB instruction cache and a 32-kB data cache

The 2-MB L2 cache logic allows for a dynamic allocation of cache space based on current core needs, so that one core can be assigned up to 100% of the L2 cache:

+ MESI( Modified, Exclusive, Shared, Invalid ) support for L1 caches

+ Extended to support multiple Core Duo in symmetric multiprocessor (SMP).

Each core has an independent thermal control unit. It designed to manage chip heat dissipation to maximize processor performance

The Advanced Programmable Interrupt Controller (APIC) performs a number of functions:

Provide interprocessor interrupts, which allow any process to interrupt any other processor or set of processor

Accepts I/O interrupts and routes these to the appropriate core

Each APIC includes a timer, which can be set by the OS to generate an interrupt to the local core

The power management logic is responsible for reducing power consumption when possible

+ In essence, the power management logic monitors thermal conditions and CPU activity and adjusts voltage levels and power consumption appropriately

The bus interface connects to the external bus, known as the Front Side Bus, which connects to main memory, I/O controllers, and other processor chips

The Intel Core i7-990X

Introduced in November of 2008

6 x86 simultaneous multithreading (SMT) processors, each with a dedicated L2 cache, and with a shared L3 cache

The Core i7-990X chip supports two forms of external communications to other chips. The DDR3 memory controllers and The QuickPath Interconnect

+ The DDR3 memory controller brings the memory controller for the DDR main memory onto the chip. The interface supports three channels that are 8 bytes wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s

+ The QuickPath Interconnect enables high-speed communications among connected processor chips. The QPI link operates at 6.4 GT/s (gigatransfers per second).

Cache Latency Comparison

L2 Cache

L3 Cache

L1 Cache

Clock Frequency

CPU

15 cycle

3 cycle

2.66 GHz

Core 2 Quad

4 cycle

11 cycle

39 cycle

Core I7

2.66 GHz

ARM11 MPCORE

Nguyễn Tiến Hưng

The ARM11 MPCore is a multicore product based on the ARM11 processor family

The ARM11 MPCore can be configured with up to four processors, each with its own L1 instruction and data caches, per chip

Distributed interrupt controller (DIC): Handles interrupt detection and interrupt prioritization. The DIC distributes interrupts to individual processors

Timer: Each CPU has its own private timer that can generate interrupts

Watchdog: Issues warning alerts in the event of software failures

CPU interface: Handles interrupt acknowledgment, interrupt masking, and interrupt completion acknowledgement

CPU: A single ARM11 processor. Individual CPUs are referred to as MP11 CPUs

Vector floating-point (VFP) unit: A coprocessor that implements floatingpoint operations in hardware

L1 cache: Each CPU has its own dedicated L1 data cache and L1 instruction cache

Snoop control unit (SCU): Responsible for maintaining coherency among L1 data caches

Interrupt Handling

The Distributed Interrupt Controller (DIC) collates interrupts from a large number of sources. It provides

• Distribution of the interrupts to the target MP11 CPUs

• Masking of interrupts

• Tracking the status of interrupts

• Prioritization of the interrupts

• Generation of interrupts by software

The DIC is designed to satisfy two functional requirements:

• Provide a means of routing an interrupt request to a single CPU or multiple CPUs, as required

• Provide a means of interprocessor communication so that a thread on one CPU can cause activity by a thread on another CPU

The DIC can route an interrupt to one or more CPUs in the following three ways:

• An interrupt can be directed to a specific processor only

• An interrupt can be directed to a defined group of processors. The MPCore views the first processor to accept the interrupt, typically the least loaded, as being best positioned to handle the interrupt

• An interrupt can be directed to all processors

The DIC is configurable to support between 0 and 255 hardware interrupt inputs

The Interrupt Distributor transmits to each CPU Interface the highest Pending interrupt for that interface

It receives back the information that the interrupt has been acknowledged, and can then change the status of the corresponding interrupt. The CPU Interface also transmits End of Interrupt Information (EOI), which enables the Interrupt Distributor to update the status of this interrupt from Active to Inactive

DIGITAL TECH PRESENTATION

Start designing with a free template

View

Corporate Christmas Presentation

View

Business Results Presentation

View

Meeting Plan Presentation

View

Customer Service Manual

View

Business vision deck

View

Economic Presentation

View

Tech Presentation Mobile

Transcript

Increase in Parallelism and Complexity

Increase in Parallelism and Complexity

Store

Decode

Fetch

Execute

Pipeline

Is a technique of decomposing a sequential process into sub-processes

Decode

Fetch

Execute

Instruction

Execute

Fetch

Decode

Execute

Decode

Fetch

With the same complexity can be implemented by a pipeline processor

Time

Multiple pipelines are constructed by replicating execution resources. This allows multiple instructions to be executed in parallel pipelines at the same time, as long as hazards are avoided

Superscalar

SMT is the process of a CPU splitting each of its physical cores into virtual cores, which are known as threads. This is done in order to increase performance and allow each core to run two instruction streams at once.

Core 1

Core 2

Core

L1-D

L1-I

L1-D

L1-I

L2 cache

(Superscalar or SMT)

Core 3

Core n

L1-I

L1-D

L1-I

L1-D

CPU

Power Consumption

Power Consumption

Watts/cm2

Logic

100

10

Memory

0.13

0.1

0.25

0.18

How to use all those logic transistors

Performance increase is roughly proportional to the square root of the increase in complexity

Performance increase is roughly proportional to the square root of the increase in complexity

In other word:

Transistor logic

Software on Multicore

A Program

Let say:

T = Total time of serial execution

B = Total time of non-parallelizable part

T - B = Total time of parallelizable part

N = The number of threads or CPUs

More thread per CPUs