Bài giảng Parallel computing & Distributed systems - Chapter 2: Parallel computer architectures

Flynn’s Taxonomy

• Based on notions of instruction and data streams

– SISD (a Single Instruction stream, a Single Data stream )

– SIMD (Single Instruction stream, Multiple Data streams )

– MISD (Multiple Instruction streams, a Single Data stream)

– MIMD (Multiple Instruction streams, Multiple Data stream)

• Popularity

– MIMD > SIMD > MISD

pdf49 trang | Chia sẻ: Thục Anh | Ngày: 12/05/2022 | Lượt xem: 465 | Lượt tải: 0download
Bạn đang xem trước 20 trang nội dung tài liệu Bài giảng Parallel computing & Distributed systems - Chapter 2: Parallel computer architectures, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Computer Engineering – CSE – HCMUT Parallel computing & Distributed systems Chapter 2: Parallel Computer Architectures Adapted from Prof. Thoai Nam/HCMUT 1 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Outline • Flynn’s Taxonomy • Classification of Parallel Computers Based on Architectures 2 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Flynn’s Taxonomy • Based on notions of instruction and data streams – SISD (a Single Instruction stream, a Single Data stream ) – SIMD (Single Instruction stream, Multiple Data streams ) – MISD (Multiple Instruction streams, a Single Data stream) – MIMD (Multiple Instruction streams, Multiple Data stream) • Popularity – MIMD > SIMD > MISD 3 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SISD • SISD – Conventional sequential machines 4 IS : Instruction Stream DS : Data Stream CU : Control Unit PU : Processing Unit MU : Memory Unit CU PU MU IS IS DS I/O Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SIMD • SIMD – Vector computers, processor arrays – Special purpose computations 5 PE : Processing Element LM : Local Memory CU PE1 LM1 PEn LMn DS DS DS DS IS IS Program loaded from host Data sets loaded from host SIMD architecture with distributed memory Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SIMD array 6 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SIMD for convolution 7 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SIMD for matrix multiplication 8 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT MISD • MISD – Systolic arrays – Special purpose computations 9 Memory (Program, Data) PU1 PU2 PUn CU1 CU2 CUn DS DS DS IS IS IS IS IS DS I/O MISD architecture (the systolic array) Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT MIMD • MIMD – General purpose parallel computers 10 CU1 PU1 Shared Memory IS IS DS I/O CUn PUn IS DSI/O IS MIMD architecture with shared memory Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Classification based on Architecture • Pipelined Computers • Dataflow Architectures • Data Parallel Systems • Multiprocessors • Multicomputers 11 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Pipeline computers • Instructions are divided into a number of steps (segments, stages) • At the same time, several instructions can be loaded in the machine and be executed in different steps 12 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Pipeline computers • IF: instruction fetch • ID: instruction decode and register fetch • EX: execution and effective address calculation • MEM: memory access • WB: write back 13 Instruction i IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instruction i+1 Instruction i+2 Instruction i+3 Instruction # 1 2 3 4 5 6 7 8 Cycles MEM MEM MEM MEM MEM 9 Instruction i+4 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards – A required resource is busy • Data hazard – Need to wait for previous instruction to complete its data read/write • Control hazard – Deciding on control action depends on previous instruction 14 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Pipeline computer • Five stages, one step per stage – IF: Instruction fetch from memory – ID: Instruction decode & register read – EX: Execute operation or calculate address – MEM: Access memory operand – WB: Write result back to register 15 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Structure hazard • Conflict for use of a resource • In MIPS pipeline with a single memory – Load/store requires data access – Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” • Hence, pipelined datapaths require separate instruction/ data memories – Or separate instruction/data caches 16 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data hazards • An instruction depends on completion of data access by a previous instruction add  $s0, $t0, $t1 sub  $t2, $s0, $t3 17 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Forwarding (aka Bypassing) • Use result when it is computed – Don’t wait for it to be stored in a register – Requires extra connections in the datapath 18 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Code scheduling • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; 19 lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Example of branch prediction 20 Prediction correct Prediction incorrect Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Realistic branch prediction • Static branch prediction – Based on typical branch behavior – Example: loop and if-statement branches • Predict backward branches taken • Predict forward branches not taken • Dynamic branch prediction – Hardware measures actual branch behavior • e.g., record recent history of each branch – Assume future behavior will continue the trend • When wrong, stall while re-fetching, and update history 21 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Control hazards • Branch determines flow of control – Fetching next instruction depends on branch outcome – Pipeline can’t always fetch correct instruction • Still working on ID stage of branch • Solution – Need to compare registers and compute target early in the pipeline – Add hardware to do it in ID stage 22 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Branch prediction • Longer pipelines can’t readily determine branch outcome early – Stall penalty becomes unacceptable • Predict outcome of branch – Only stall if prediction is wrong • In MIPS pipeline – Can predict branches not taken – Fetch instruction after branch, with no delay 23 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Dataflow architecture • Data-driven model – A program is represented as a directed acyclic graph in which a node represents an instruction and an edge represents the data dependency relationship between the connected nodes – Firing rule • A node can be scheduled for execution if and only if its input data become valid for consumption • Dataflow languages – Id, SISAL, Silage, LISP,... – Single assignment, applicative(functional) language – Explicit parallelism 24 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Dataflow graph - DFG 25 z = (a + b) * c + * a b c z The dataflow representa]on of an arithme]c expression Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Dataflow computers • Execution of instructions is driven by data availability – What is the difference between this and normal (control flow) computers? • Advantages – Very high potential for parallelism – High throughput – Free from side-effect • Disadvantages – Time lost waiting for unneeded arguments – High control overhead – Difficult in manipulating data structures 26 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Dataflow computers: generic architecture 27 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Dataflow representation 28 / d1 e1 * + f1 / d2 e2 * + f2 / d3 e3 * + f3 / d4 e4 * + f4 a1 b1 a2 b2 a3 b3 a4 b4 c0 c4 c1 c2 c3 input d,e,f     c0 = 0 for i from 1 to 4 do     begin         ai := di / ei         bi := ai * fi         ci := bi + ci-1     end output a, b, c Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Execution: control flow machine • Assume all the external inputs are available before entering do loop – + : 1 cycle, * : 2 cycles, / : 3 cycles • How long dose it take to execute this program on a dataflow computer with 4 processors? 29 a1 b1 c1 a2 b2 c2 a4 b4 c4 Sequen]al execu]on on a uniprocessor in 24 cycles Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Execution: dataflow machine • Data-driven execution on a 4-processor dataflow computer in 9 cycles – • Can we further reduce the execution time of this program? S = 24 9 = 2.67 × 30 c1 c2 c3 c4a1 a2 a3 a4 b1 b2 b4 b3 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data parallelism • Programming model – Operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps – Conceptually, a processor associated with each data element 31 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data parallelism • SIMD Architectural model – Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, cheap global synchronization 32 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Vector processor • Instruction set includes operations on vectors as well as scalars • 2 types of vector computers – Processor arrays – Pipelined vector processors 33 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Processor array • A sequential computer connected with a set of identical processing elements simultaneous doing the same operation on different data 34 Data path Processing element Data memory In te rc on ne ct io n ne tw or k Processing element Data memory Processing element Data memory I/0 Processor array Instruction path Program and Data Memory CPU I/O processor Front-end computer I/0 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Pipeline vector processor • Stream vector from memory to the CPU • Use pipelined arithmetic units to manipulate data – Eg: Cray-1, Cyber-205 35 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Multiprocessor • Consists of many fully programmable processors each capable of executing its own program • Shared address space architecture • Classified into 2 types – Uniform Memory Access (UMA) Multiprocessors – Non-Uniform Memory Access (NUMA) Multiprocessors 36 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT UMA processor • Uses a central switching mechanism to reach a centralized shared memory • All processors have equal access time to global memory • Tightly coupled system • Problem: cache consistency 37 P1 Switching mechanism I/O P2 Pn Memory banks C1 C2 Cn Pi Ci Processor i Cache i Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT UMA processor • Crossbar switching mechanism 38 Mem Mem Mem Mem Cache P I/OCache P I/O Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT UMA processor • Shared bus mechanism 39 Mem Mem Mem Mem Cache P I/OCache P I/O Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT NUMA processor • Distributed shared memory combined by local memory of all processors • Memory access time depends on whether it is local to the processor • Caching shared (particularly nonlocal) data? 40 Network Cache P Mem Cache P Mem Cache P Mem Cache P Mem Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Current types of multiprocessors • PVP (Parallel Vector Processor) – A small number of proprietary vector processors connected by a high-bandwidth crossbar switch • SMP (Symmetric Multiprocessor) – A small number of COST microprocessors connected by a high-speed bus or crossbar switch • DSM (Distributed Shared Memory) – Similar to SMP – The memory is physically distributed among nodes. 41 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Parallel vector processor 42 VP VP Crossbar Switch VP SM SM SM VP : Vector Processor SM : Shared Memory Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Symmetric multiprocessor 43 P/C P/C Bus or Crossbar Switch P/C SM SM SM P/C : Microprocessor and Cache SM: Shared Memory Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Distributed shared memory 44 Custom-Designed Network MB MB P/C LM NIC DIR P/C LM NIC DIR MB: Memory Bus P/C: Microprocessor & Cache LM: Local Memory DIR: Cache Directory NIC: Network Interface Circuitry Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Multicomputers • Consists of many processors with their own memory • No shared memory • Processors interact via message passing loosely coupled system → 45 P/C P/C Message-passing Interconnection Network P/C M M M P/C: Microprocessor & Cache M: Memory Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Current types of multicomputers • MPP (Massively Parallel Processing) – Total number of processors > 1000 • Cluster – Each node in system has less than 16 processors • Constellation – Each node in system has more than 16 processors 46 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Massively parallel processing 47 P/C LM NIC MB P/C LM NIC MB Custom-Designed Network P/C: Microprocessor & Cache MB: Memory Bus NIC: Network Interface Circuitry LM: Local Memory Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Clusters 48 Commodity Network (Ethernet, ATM, Myrinet, InfiniBand (VIA)) MB MB P/C M NIC P/C M Bridge Bridge LD LD NIC IOB IOB MB: Memory Bus P/C: Microprocessor & Cache M: Memory LD: Local Disk IOB: I/O Bus NIC: Network Interface Circuitry Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Constellations 49 P/CP/C SM SM NICLD Hub Custom or Commodity Network >= 16 IOC P/CP/C SM SM NICLD Hub >= 16 IOC P/C: Microprocessor & Cache MB: Memory Bus NIC: Network Interface Circuitry SM: Shared Memory IOC: I/O Controller LD: Local Disk

Các file đính kèm theo tài liệu này:

  • pdfbai_giang_parallel_computing_distributed_systems_chapter_2_p.pdf