Attachment 'notes_yelick.txt'
Download 1 Jan 30: 9:00 a.m.
2 Programming Models for Parallel Computing
3 Katherine Yelick
4
5 http://titanium.cs.berkeley.edu/
6 http://upc.lbl.gov/
7
8 Not long ago, it looked as if the future of parallel computing was uncertain. Several panels titled "is parallel processing dead?"
9 No no, Moore's law is alive and well. Expect to continue for the next decade.
10 However, clock scaling bonanza has ended (no more speed increase there).
11 Must go to multicore processors.
12 -More cores with lower clock rates burn less power, lower temperatures.
13 -Instruction Level Parallelism (ILP) benefits declining
14 (We've had parallelism on single chips hidden from programmers)
15 The chips had the hardware (overengineered), but the software did not make use of it. Designers backed off.
16 -Yield Problems
17 IBM Cell processors have about a 10% yield. Blade system with all 8 is $20k, PS3 with 7 is $600.
18
19 Power Density Limits Serial Performance.
20 -Nowadays, same heat as a rocket nozzel.
21
22 The Revolution is Happening Now.
23 -Chip density continuing by Moore's Law, but clock speed is not. Little to no ILP to be found.
24
25 Why parallelism? (2007)
26 Multicore is all over the place. Not just theory any longer.
27 Will all programmers become performance programmers? Parallelism can "hide costs" by preprocessing / using parallelism
28 New features will have to be well-hidden, as clock speeds aren't increasing like they used to.
29 Big Open Questions:
30 1) What are the killer applications for multicore machines?
31 2) How should the chips be designed: multicore, manycore, heterogeneous?
32 3) How will they be programmed?
33 Intel announced they're looking at an 80 core processor.
34
35 We seem to be headed towards a PetaFLOp machine in 2008. (data from top500.org)
36 There's a 6-8 year lag between the #1 fastest computer and the #500 fastest, which is what many people are programming on. Will PetaFLOp machines be common by 2015?
37 Memory Hierarchy:
38 With explicit parallelism, performance becomes a software problem.
39 Off-chip latencies tend to go by only about 7ms/year
40
41 Predictions:
42 Parallelism will explode. Cores will double by about Moore's Law. All top 500 machines will be PetaFLOp machines by 2015.
43 Performance will be placed on the software.
44 A parallel programming model will emerge to handle multicore programming and parallelism in general.
45
46
47 PGAS Languages
48
49 Parallel software is still an unsolved problem.
50 Most parallel programs are written using either Message passing with a SPMD model (scientific applications, scales easily) or shared memory with threads in OpenMP, Threads, or Java (non-scientific applications, easier to program, also lends itself very easily to user interfaces).
51
52 Partitioned Global Address Space (PGAS) Languages:
53 Maintains control over locality. Performance, programmability, flexability.
54
55 PGAS: Data is partitioned (designated) as global or local. Any thread/process may directly read/write data allocated by another.
56 3 current languages use an SPMD (Single Program, Multiple Data) execution model. 3 more are emerging: X10, Fortress, Chapel.
57 Remote references have a higher latency time, so should be used judiciously.
58
59 PGAS, commonalities:
60 Have both private and shared data
61 Support for distributed data structures
62 One-sided shared-memory communication
63 Synchronization (global barriers, locks, memory fences)
64 Collective Communication, IO Libraries, etc.
65 The 3 current languages are built to be close to the language off which they are based. UPC is based off of C, so it's low-level. Titanium is based off of Java, so is higher-level.
66
67 Private vs Shared Variables in UPC:
68 C variables and objects are allocated in the private memory space.
69 Shared variables are allocated only once, in thread 0's space.
70 Shared arrays are spread across the fields. (can be blocked or spread automatically)
71 Heap objects may be in either private or shared space
72 Titanium has a higher-level array abstraction.
73 These models all assume a static thread count. The HPCs languages are looking at dynamic thread counting.
74 The UPC compiler we're working on is an extention of an the existing Open64 compiler framework.
75 There's no JVM in Titanium. There's a lot of web programming built in to Java, which was not appropriate for high-performance parallel programming.
76
77 Are PGAS Languages good for multicore machines?
78 -They work very well on shared memory.
79 -Current UPC and Titanium implementations use threads.
80 -OpenMP gives substantial competition against PGAS on shared memory
81 -Unsure whether multicore processors will continue to have physical shared memory or move more towards something like a cell processor with explicit local storage.
82
83
84 PGAS Languages on Clusters: One-sided vs Two-sided Communication.
85 One-sided:
86 -Put/Get message can be handled directly by a network interface with RDMA support
87 -Some networks will now see a one-sided message and write it directly to memory without disturbing the CPU.
88 InfiniBand and similar networks have support for one-sided messages, ethernet does not currently.
89 Two-sided:
90 -Need to be matched with a receive to identify memory address to put data.
91 -Need to download match tables (from host) to interface.
92
93 One-Sided vs Two-: Practice
94 Half power point differs by one order of magnitude better (lower) for GASNet vs MPI.
95
96 GASNet has better latency across a network.
97 GASNet is at least as high (comparable) for large messages.
98 These numbers were calculated by comparing with MPI1, not MPI2, which does have one-sided message implementation. This implementation is not as efficient on many machines and most people use MPI1, which is why MPI1 was used for comparison purposes.
99 GASNet excels at mid-range message sizes, which is important for overlap, or asynchronous algorithms.
100
101 Making PGAS Real: Applications and Portability
102 AMR: Adaptive Mesh Refinement code.
103 Titanium AMR is entirely emplimented in Titanium and allows for finer-grained communication. Leads to 10x reduction in lines of code.
104
105 Beats Chombo code, the AMR package in LBNL.
106 The array abstraction in Titanium not only used in AMR--can be used across a wide array of applications.
107
108 In Serial, Titanium is comparable to (within a few % of) C++/Fortran.
109
110 Dense and Sparse Matrix Factorization:
111 As you break apart and factor a matrix, the dependant factors change on the fly. Especially with sparse matricies, dependencies change dramatically.
112 UPC factorization uses a highly multithreaded style, used to mask latency and dependence delays.
113 Three levels of threads: Static UPC threads, multithreaded BLAS, user-level threads with explicit yield.
114 No dynamic load balancing, but lots of remote invocation
115 Layout is fixed and tuned for block size
116 Many hard problem in here. Block size tuning for both locality and granularity. Task prioritization. et al.
117
118 UPC is significantly faster than ScaLAPACK due to the multithreadding to hide latency and dependence delays.
119
120 Most PGAS symbolic computing applications are numeric.
121 The applications all require:
122 Complex, irregular shared data structures
123 Ability to communicate and share data asynchronously
124 -Many current implementations built off of one-sided communication
125 Fast, low-overhead communication/sharing
126
127 Titanium and UPC are qiute portable.
128 -Beneath the implementation (which goes to C) is a common communication layer (GASNet), also used by gcc/upc.
129 -Both run on most PCs, SMPs, clusters, and supercomputers.
130 -Several compilers for Titanium and UPC.
131
132 PGAS Languages can easily go between shared and distributed memory machines.
133 Many persons are currently working on dynamic parallel environments (non-static thread counts).
134 Provide control over locallity and SPMD.
135
136 Languages with exceptions are still a major headache. Analysis is ongoing.
137 GASNet is the common framework between the PGAS languages. It can be used as a common framework for parallel implementation of your own work.
138 Both UPC and Titanium are working on being able to better support the ability to distinguish within a node and between nodes.
Attached Files
To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.You are not allowed to attach a file to this page.