Bài giảng Parallel computing & Distributed systems - Chapter 4: Parallel matrix processing

Introduction

• OpenMP is one of the most common parallel programming models in

use today

• OpenMP is an API for writing multithreaded applications

– A set of compiler directive (#pragma) and library routines for parallel

application programmers

– C/C++ and Fortran

pdf69 trang | Chia sẻ: Thục Anh | Ngày: 12/05/2022 | Lượt xem: 248 | Lượt tải: 0download
Bạn đang xem trước 20 trang nội dung tài liệu Bài giảng Parallel computing & Distributed systems - Chapter 4: Parallel matrix processing, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
el and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data sharing: fisrtprivate clause • Variables initialized from shared variable • Each thread gets its own copy of incr with an initial value of 0 48 incr = 0; #pragma omp parallel for firstprivate(incr) for (i = 0; i <= MAX; i++) { if ((i%2)==0) incr++; A[i] = incr; } Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data sharing: lastprivate clause • Variables update shared variable using value from last iteration 49 void sq2(int n, double *lastterm){ double x; int i; #pragma omp parallel for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } *lastterm = x; } “x” has the value it held for the “last sequential” iteration (i.e., for i=(n-1)) Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data sharing: test • Consider this example of PRIVATE and FIRSTPRIVATE • Are A,B,C local to each thread or shared inside the parallel region? • What are their initial values inside and values after the parallel region? 50 variables: A = 1, B = 1, C = 1 #pragma omp parallel private(B) firstprivate(C) Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Data sharing: default clause • Note that the default storage attribute is default(shared) (so no need to use it) – Exception: #pragma omp task • To change default: default(private) – each variable in the construct is made private as if specified in a private clause – mostly saves typing • default(none): no default for variables in static extent. Must list storage attribute for each variable in static extent 51 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Default clause: example 52 itotal = 1000 #prgama omp parallel private(np, each) { np = omp_get_num_threads() each = itotal/np } itotal = 1000 #pragma omp parallel default(priavte) shared(itotal) { np = omp_get_num_threads() each = itotal/np } is equivalent to Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Exercise 5: Mandelbrot set area • Mandelbrot set: the set of complex number c for which the function does not diverge when iterated from z = 0 𝑓𝑐 = 𝑧2 + 𝑐 53 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Exercise 5: Mandelbrot set area • The supplied programs (mandel.c & mandel_serial.c) compute the area of a Mandelbrot set – mandel.c: parallel version – mandel_serial.c: serial version, the correct one • The parallel program has been parallelized with OpenMP, but we were lazy and didn’t do it right • Find and fix the errors (hint the problem is with the data environment): – A wrong result is produced (different from the result generated by the serial version) 54 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Reference 55 STREAMING SIMD EXTENSION (SSE) 56 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SIMD Architecture • SIMD = Single Instruction, Multiple Data • A data parallel architecture • Applying the same instruction to many data – Save control logic – A related architecture is the vector architecture – SIMD and vector architectures offer high performance for vector/ matrix-based operations 57 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Vector operations • Vector addition Z = X + y for(i=0;i<n;i++) z[i]=x[i]+y[i]; • Vector scaling Y = a*X for(i=0;i<n;i++) y[i]=a*x[i]; • Dot product for(i=0;i<n;i++) r+=x[i]*y[i]; 58 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SISD vs. SIMD operations • C = A + B for (i=0; i<n; i++) c[i] = a[i] + b[i] 59 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT x86 architecture SIMD support • Both current AMD and Intel’s x86 processors have ISA and microarchitecture support SIMD operations. • ISA SIMD support – MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX • See the flag field by “cat /proc/cpuinfo” with a linux-based machine • “sysctl -a | grep cpu.feat” with a MacOS – SSE (Streaming SIMD extensions): a SIMD instruction set extension to the x86 architecture • Instructions for operating on multiple data simultaneously (vector operations) • Micro architecture support – Many functional units – 8 128-bit vector registers, XMM0, XMM1, , XMM7 60 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SSE programming • Vector registers support three data types: – Integer (16 bytes, 8 shorts, 4 int, 2 long long int, 1 dqword) – single precision floating point (4 floats) – double precision float point (2 doubles). 61 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SSE programming in C/C++ • Map to intrinsics – An intrinsic is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions. Intrinsic functions are inherently more efficient than called functions because no calling linkage is required. • Intrinsics provides a C/C++ interface to use processor- specific enhancements • Supported by major compilers such as gcc 62 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SSE intrinsics • Header files to access SEE intrinsics – #include //MMX – #include //SSE – #include //SSE2 – #include //SSE3 – #include //SSE3 – #include //SSE4 • MMX/SSE/SSE2 are mostly supported • SSE4 are not well supported. • When compile, use -msse, ‐mmmx, ‐msse2 (machine dependent code) – Some are default for gcc 63 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SSE intrinsics • Data types (mapped to an xmm register) __m128: float __m128d: double __m128i: integer • Data movement and initialization _mm_load_ps, _mm_loadu_ps, _mm_load_pd, _mm_loadu_pd, etc _mm_store_ps, _mm_setzero_ps • Arithemetic intrinsics _mm_add_ss, _mm_add_ps, _mm_add_pd, _mm_mul_pd • More details: check out MSDN from Microsoft 64 https://msdn.microsoft.com/en-us/library/kcwz153a(v=vs.90).aspx Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Example • Check out ex1.c at https://www.dropbox.com/s/ jf6lnc7df1wclru/ex1.c?dl=0 • Check out sapxy.c at https://www.dropbox.com/s/ r1g1roydtspa4xv/sapxy.c?dl=0 • Please consult MSDN from Microsoft about the instructions used in those programs 65 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT SSE intrinsics • Data alignment issue – Some intrinsics may require memory to be aligned to 16 bytes. – May not work when memory is not aligned. • See sapxy1.c at https://www.dropbox.com/s/ik7xiyy8q1gu0w5/sapxy1.c? dl=0 • Writing more generic SSE routine – Check memory alignment – Slow path may not have any performance benefit with SSE • See sapxy2.c at https://www.dropbox.com/s/tt4xznt5impan0v/sapxy2.c? dl=0 66 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT Summary • Contemporary CPUs have SIMD support for vector operations – SSE is its programming interface • SSE can be accessed at high level languages through intrinsic functions. • SSE Programming needs to be very careful about memory alignments – Both for correctness and for performance. 67 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT One more example • Check out division program at https://www.dropbox.com/ s/k1ny1gv1pkogikj/division.zip?dl=0 68 Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT References • Intel® 64 and IA-32 Architectures Software Developer's Manuals (Combined volume 1-4) – https://software.intel.com/sites/default/files/managed/39/ c5/325462-sdm-vol-1-2abcd-3abcd.pdf • MSDN library, Microsoft – https://msdn.microsoft.com/en-us/library/26td21ds(v=vs.90).aspx 69

Các file đính kèm theo tài liệu này:

  • pdfbai_giang_parallel_computing_distributed_systems_chapter_4_p.pdf