attachment:notes_yelick.txt of msri07/notes

Toggle line numbers
   1 Jan 30:  9:00 a.m.
   2 Programming Models for Parallel Computing
   3 Katherine Yelick
   4 
   5 http://titanium.cs.berkeley.edu/
   6 http://upc.lbl.gov/
   7 
   8 Not long ago, it looked as if the future of parallel computing was uncertain.  Several panels titled "is parallel processing dead?"
   9 No no, Moore's law is alive and well.  Expect to continue for the next decade.
  10 However, clock scaling bonanza has ended (no more speed increase there).
  11 Must go to multicore processors.
  12 	-More cores with lower clock rates burn less power, lower temperatures.
  13 	-Instruction Level Parallelism (ILP) benefits declining
  14 	(We've had parallelism on single chips hidden from programmers)
  15 	The chips had the hardware (overengineered), but the software did not make use of it.  Designers backed off.
  16 	-Yield Problems
  17 	IBM Cell processors have about a 10% yield.  Blade system with all 8 is $20k, PS3 with 7 is $600.
  18 
  19 Power Density Limits Serial Performance.
  20 	-Nowadays, same heat as a rocket nozzel.
  21 
  22 The Revolution is Happening Now.
  23 	-Chip density continuing by Moore's Law, but clock speed is not.  Little to no ILP to be found.
  24 
  25 Why parallelism? (2007)
  26 	Multicore is all over the place.  Not just theory any longer.
  27 	Will all programmers become performance programmers?  Parallelism can "hide costs" by preprocessing / using parallelism
  28 	New features will have to be well-hidden, as clock speeds aren't increasing like they used to.
  29 Big Open Questions:
  30 	1) What are the killer applications for multicore machines?
  31 	2) How should the chips be designed:  multicore, manycore, heterogeneous?
  32 	3) How will they be programmed?
  33 Intel announced they're looking at an 80 core processor.
  34 
  35 We seem to be headed towards a PetaFLOp machine in 2008.  (data from top500.org)
  36 There's a 6-8 year lag between the #1 fastest computer and the #500 fastest, which is what many people are programming on.  Will PetaFLOp machines be common by 2015?
  37 Memory Hierarchy:
  38 With explicit parallelism, performance becomes a software problem.
  39 Off-chip latencies tend to go by only about 7ms/year
  40 
  41 Predictions:
  42 	Parallelism will explode.  Cores will double by about Moore's Law.  All top 500 machines will be PetaFLOp machines by 2015.
  43 	Performance will be placed on the software.
  44 	A parallel programming model will emerge to handle multicore programming and parallelism in general.
  45 
  46 
  47 PGAS Languages
  48 
  49 Parallel software is still an unsolved problem.
  50 Most parallel programs are written using either Message passing with a SPMD model (scientific applications, scales easily) or shared memory with threads in OpenMP, Threads, or Java (non-scientific applications, easier to program, also lends itself very easily to user interfaces).
  51 
  52 Partitioned Global Address Space (PGAS) Languages:
  53 	Maintains control over locality.  Performance, programmability, flexability.
  54 
  55 PGAS:  Data is partitioned (designated) as global or local.  Any thread/process may directly read/write data allocated by another.
  56 3 current languages use an SPMD (Single Program, Multiple Data) execution model.  3 more are emerging:  X10, Fortress, Chapel.
  57 	Remote references have a higher latency time, so should be used judiciously.
  58 
  59 PGAS, commonalities:
  60 	Have both private and shared data
  61 	Support for distributed data structures
  62 	One-sided shared-memory communication
  63 	Synchronization (global barriers, locks, memory fences)
  64 	Collective Communication, IO Libraries, etc.
  65 The 3 current languages are built to be close to the language off which they are based.  UPC is based off of C, so it's low-level.  Titanium is based off of Java, so is higher-level.
  66 
  67 Private vs Shared Variables in UPC:
  68 	C variables and objects are allocated in the private memory space.
  69 	Shared variables are allocated only once, in thread 0's space.
  70 	Shared arrays are spread across the fields.  (can be blocked or spread automatically)
  71 	Heap objects may be in either private or shared space
  72 Titanium has a higher-level array abstraction.
  73 These models all assume a static thread count.  The HPCs languages are looking at dynamic thread counting.
  74 The UPC compiler we're working on is an extention of an the existing Open64 compiler framework.
  75 There's no JVM in Titanium.  There's a lot of web programming built in to Java, which was not appropriate for high-performance parallel programming.
  76 
  77 Are PGAS Languages good for multicore machines?
  78 	-They work very well on shared memory.
  79 	-Current UPC and Titanium implementations use threads.
  80 	-OpenMP gives substantial competition against PGAS on shared memory
  81 	-Unsure whether multicore processors will continue to have physical shared memory or move more towards something like a cell processor with explicit local storage.
  82 
  83 
  84 PGAS Languages on Clusters:  One-sided vs Two-sided Communication.
  85 One-sided:
  86 	-Put/Get message can be handled directly by a network interface with RDMA support
  87 	-Some networks will now see a one-sided message and write it directly to memory without disturbing the CPU.
  88 	InfiniBand and similar networks have support for one-sided messages, ethernet does not currently.
  89 Two-sided:
  90 	-Need to be matched with a receive to identify memory address to put data.
  91 	-Need to download match tables (from host) to interface.
  92 
  93 One-Sided vs Two-: Practice
  94 	Half power point differs by one order of magnitude better (lower) for GASNet vs MPI.
  95 
  96 GASNet has better latency across a network.
  97 GASNet is at least as high (comparable) for large messages.
  98 These numbers were calculated by comparing with MPI1, not MPI2, which does have one-sided message implementation.  This implementation is not as efficient on many machines and most people use MPI1, which is why MPI1 was used for comparison purposes.
  99 GASNet excels at mid-range message sizes, which is important for overlap, or asynchronous algorithms.
 100 
 101 Making PGAS Real:  Applications and Portability
 102 AMR: Adaptive Mesh Refinement code.
 103 Titanium AMR is entirely emplimented in Titanium and allows for finer-grained communication.  Leads to 10x reduction in lines of code.
 104 
 105 Beats Chombo code, the AMR package in LBNL.
 106 The array abstraction in Titanium not only used in AMR--can be used across a wide array of applications.
 107 
 108 In Serial, Titanium is comparable to (within a few % of) C++/Fortran.
 109 
 110 Dense and Sparse Matrix Factorization:
 111 	As you break apart and factor a matrix, the dependant factors change on the fly.  Especially with sparse matricies, dependencies change dramatically.
 112 	UPC factorization uses a highly multithreaded style, used to mask latency and dependence delays.
 113 	Three levels of threads: Static UPC threads, multithreaded BLAS, user-level threads with explicit yield.
 114 	No dynamic load balancing, but lots of remote invocation
 115 	Layout is fixed and tuned for block size
 116 	Many hard problem in here.  Block size tuning for both locality and granularity.  Task prioritization.  et al.
 117 
 118 UPC is significantly faster than ScaLAPACK due to the multithreadding to hide latency and dependence delays.
 119 
 120 Most PGAS symbolic computing applications are numeric.
 121 The applications all require:
 122 	Complex, irregular shared data structures
 123 	Ability to communicate and share data asynchronously
 124 		-Many current implementations built off of one-sided communication
 125 	Fast, low-overhead communication/sharing
 126 
 127 Titanium and UPC are qiute portable.
 128 	-Beneath the implementation (which goes to C) is a common communication layer (GASNet), also used by gcc/upc.
 129 	-Both run on most PCs, SMPs, clusters, and supercomputers.
 130 	-Several compilers for Titanium and UPC.
 131 
 132 PGAS Languages can easily go between shared and distributed memory machines.
 133 Many persons are currently working on dynamic parallel environments (non-static thread counts).
 134 Provide control over locallity and SPMD.
 135 
 136 Languages with exceptions are still a major headache.  Analysis is ongoing.
 137 GASNet is the common framework between the PGAS languages.  It can be used as a common framework for parallel implementation of your own work.
 138 Both UPC and Titanium are working on being able to better support the ability to distinguish within a node and between nodes.
attachment:notes_yelick.txt of msri07/notes

Attachment 'notes_yelick.txt'

Attached Files