Want to create interactive content? It’s easy in Genially!
DIGITAL TECH PRESENTATION
Pham Khanh
Created on July 12, 2022
Start designing with a free template
Discover more than 1500 professional designs like these:
View
Corporate Christmas Presentation
View
Business Results Presentation
View
Meeting Plan Presentation
View
Customer Service Manual
View
Business vision deck
View
Economic Presentation
View
Tech Presentation Mobile
Transcript
Multicore Computer
Loading........
PRESENTATION by Group 8
Multicore Processor
Core
01
Hardware Performance
Phạm Minh Khánh
01
Increase in Parallelism and Complexity
01
Increase in Parallelism and Complexity
Store
Decode
Fetch
Execute
Pipeline
Pipelining
01
Is a technique of decomposing a sequential process into sub-processes
Decode
Fetch
Execute
Instruction
Execute
Fetch
Decode
Execute
Decode
Fetch
With the same complexity can be implemented by a pipeline processor
Time
Superscalar
01
Multiple pipelines are constructed by replicating execution resources. This allows multiple instructions to be executed in parallel pipelines at the same time, as long as hazards are avoided
01
Superscalar
Simultaneous multithreading (SMT)
01
SMT is the process of a CPU splitting each of its physical cores into virtual cores, which are known as threads. This is done in order to increase performance and allow each core to run two instruction streams at once.
01
Lorem ipsum dolor sit amet consectetur adipiscing
Multicore processor
Lorem ipsum dolor sit amet consectetur adipiscing
Lorem ipsum dolor sit amet consectetur adipiscing
Multicore processor
Lorem ipsum dolor sit amet consectetur adipiscing
Core 1
Core 2
Core
L1-D
L1-I
L1-D
L1-I
L2 cache
(Superscalar or SMT)
Core 3
Core n
L1-I
L1-D
L1-I
L1-D
CPU
01
Power Consumption
01
Power Consumption
Watts/cm2
Logic
100
10
Memory
0.13
0.1
0.25
0.18
01
How to use all those logic transistors
How?
Pollack’s rule
01
Pollack’s rule
01
Performance increase is roughly proportional to the square root of the increase in complexity
Pollack’s rule
01
Performance increase is roughly proportional to the square root of the increase in complexity
In other word:
If
X 2
Transistor logic
Then
40% Performance increase
01
Software Performance Issue
Phạm Minh Khánh
01
Software on Multicore
01
Amdahl's Law
Use to
Calculate how much a computation can be speed up by running of a program in parallel
01
Amdahl's Law
Use to
Calculate how much a computation can be speed up by running of a program in parallel
A Program
Part which cannot be parallelized
Part which can be parallelized
Let say:
01
T = Total time of serial execution
B = Total time of non-parallelizable part
T - B = Total time of parallelizable part
N = The number of threads or CPUs
Note
01
Normalize T = 1
Note
01
Normalize T = 1
Note
01
Normalize T = 1
01
More thread per CPUs
Equal
01
More thread per CPUs
Equal
Faster Execution Time
01
Original Execution Time
Speedup =
Execution Time after Enhancement
Speedup =
B + ( 1 - B )
The execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall
01
Speedup =
1.33
Speedup =
0.5 + ( 1 - 0.5 )
B = 0.5 N = 2
01
As the number of processors increases
But, the serial portion of each program stays the same
The amount of time required for the parallel portion of each program decreases
01
Multithreaded native applications
Multiprocess applications
Java applications
Multi-instance applications
01
From Valve’s perspective, threading granularity options are defined as follows:
01
Coarse-grained threading
Fine- grained threading
Hybrid threading
01
01
Hardware Performance
Nguyễn Tiến Hưng
01
Four general organizations for multicore systems
(a) Dedicated L1 cache
(a) Dedicated L1 cache
(b) Dedicated L2 cache
01
(c) Shared L2 cache
(a) Dedicated L1 cache
(b) Dedicated L2 cache
01
(c) Shared L2 cache
(a) Dedicated L1 cache
(b) Dedicated L2 cache
(d ) Shared L3 cache
The use of a shared L2 cache on the chip has several advantages over exclusive reliance on dedicated caches:
01
Constructive interference can reduce overall miss rates
Data shared by multiple cores is not replicated at the shared cache level
01
Threads that have a less locality can employ more cache
Interprocessor communication is easy to implement, via shared memory locations
Provide some additional performance advantage
01
INTEL x86 MULTICORE ORGANIZATION
Phạm Huy Hoàng
01
Intel Core Duo
First introduced in 2006
Implements two x86 superscalar processors with a shared L2 cache, each core has its own dedicated L1 cache, a 32-kB instruction cache and a 32-kB data cache
The 2-MB L2 cache logic allows for a dynamic allocation of cache space based on current core needs, so that one core can be assigned up to 100% of the L2 cache:
01
+ MESI( Modified, Exclusive, Shared, Invalid ) support for L1 caches
+ Extended to support multiple Core Duo in symmetric multiprocessor (SMP).
Each core has an independent thermal control unit. It designed to manage chip heat dissipation to maximize processor performance
01
The Advanced Programmable Interrupt Controller (APIC) performs a number of functions:
Provide interprocessor interrupts, which allow any process to interrupt any other processor or set of processor
Accepts I/O interrupts and routes these to the appropriate core
Each APIC includes a timer, which can be set by the OS to generate an interrupt to the local core
The power management logic is responsible for reducing power consumption when possible
01
+ In essence, the power management logic monitors thermal conditions and CPU activity and adjusts voltage levels and power consumption appropriately
The bus interface connects to the external bus, known as the Front Side Bus, which connects to main memory, I/O controllers, and other processor chips
The Intel Core i7-990X
01
Introduced in November of 2008
6 x86 simultaneous multithreading (SMT) processors, each with a dedicated L2 cache, and with a shared L3 cache
The Core i7-990X chip supports two forms of external communications to other chips. The DDR3 memory controllers and The QuickPath Interconnect
01
+ The DDR3 memory controller brings the memory controller for the DDR main memory onto the chip. The interface supports three channels that are 8 bytes wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s
+ The QuickPath Interconnect enables high-speed communications among connected processor chips. The QPI link operates at 6.4 GT/s (gigatransfers per second).
Cache Latency Comparison
01
L2 Cache
L3 Cache
L1 Cache
Clock Frequency
CPU
15 cycle
3 cycle
2.66 GHz
Core 2 Quad
4 cycle
11 cycle
39 cycle
Core I7
2.66 GHz
01
ARM11 MPCORE
Nguyễn Tiến Hưng
The ARM11 MPCore is a multicore product based on the ARM11 processor family
01
The ARM11 MPCore can be configured with up to four processors, each with its own L1 instruction and data caches, per chip
Distributed interrupt controller (DIC): Handles interrupt detection and interrupt prioritization. The DIC distributes interrupts to individual processors
01
Timer: Each CPU has its own private timer that can generate interrupts
Watchdog: Issues warning alerts in the event of software failures
CPU interface: Handles interrupt acknowledgment, interrupt masking, and interrupt completion acknowledgement
01
CPU: A single ARM11 processor. Individual CPUs are referred to as MP11 CPUs
Vector floating-point (VFP) unit: A coprocessor that implements floatingpoint operations in hardware
L1 cache: Each CPU has its own dedicated L1 data cache and L1 instruction cache
Snoop control unit (SCU): Responsible for maintaining coherency among L1 data caches
Interrupt Handling
The Distributed Interrupt Controller (DIC) collates interrupts from a large number of sources. It provides
• Distribution of the interrupts to the target MP11 CPUs
• Masking of interrupts
• Tracking the status of interrupts
• Prioritization of the interrupts
• Generation of interrupts by software
01
The DIC is designed to satisfy two functional requirements:
• Provide a means of routing an interrupt request to a single CPU or multiple CPUs, as required
• Provide a means of interprocessor communication so that a thread on one CPU can cause activity by a thread on another CPU
01
The DIC can route an interrupt to one or more CPUs in the following three ways:
• An interrupt can be directed to a specific processor only
• An interrupt can be directed to a defined group of processors. The MPCore views the first processor to accept the interrupt, typically the least loaded, as being best positioned to handle the interrupt
• An interrupt can be directed to all processors
The DIC is configurable to support between 0 and 255 hardware interrupt inputs
01