attachment:notes_roch.txt of msri07/notes

Toggle line numbers
   1 11:30 a.m.
   2 Processor-Oblivious Parallel Processing with Provable Performances
   3 Jean-Louis Roch
   4 
   5 Overview:
   6 -Introduction
   7 -Machine model and work-stealing
   8 -Cheme 1 Adaptive Parallel algorithms
   9 -Scheme 2 Nano-loop
  10 -Scheme 3 Amoritizing the overhead of parallelism
  11 -Putting things together
  12 
  13 Interactive Parallel Computation:
  14 Any application is "parallel":
  15 -composition of several programs / library procedures (possibly concurrent)
  16 -each procedure written independently and also possibly parallel, itself
  17 
  18 Parallel Interactive Application:
  19 -Human in the loop
  20 -Prallel machines (cluster) to enable large interactive applications
  21 -Two main performance criteria:
  22 	-Frequency (refresh rate)
  23 		-Visualization: 30-60Hz
  24 		-Haptic: 1000Hz
  25 	-Latency (makespan for one iteration)
  26 		-object handling: 75 ms
  27 
  28 New Parallel Supports (from small to large)
  29 -Multi-core architectures
  30 	-Dual Core procssors
  31 	-Dual Core graphics processors
  32 	-Heterogeneous multi-cores
  33 	-MPSoCs
  34 -Commodity SMPs
  35 	-8-way PCs equipped with multicore processors (AMD Hypertransport) + 2 GPUs
  36 -Clusters
  37 	-72% of top 500 machines
  38 	-Trends: more processing units, faster networks (PCI-Express)
  39 	-Heterogeneous (CPUs, GPUs, FPGAs)
  40 -Grids
  41 	-Heterogeneous networks
  42 	-Heterogeneous administration policies
  43 	-Resource volatility
  44 -Virtual Reality / Visualization Clusters
  45 	-Virtual Reality, Scientific Visualization and Computational Steering
  46 	-PC clusters + graphics cards + multiple I/O devices (cameras, 3D trackres, etc.)
  47 
  48 Parallel induces overhead:  E.G. Parallel prefix on fixed architecture
  49 Prefix problem:  Input a_0, ..., a_n.  Output : sequential products.  Serial performs only n opreations, serial performs 2n
  50 but 2*log(n) time.
  51 Optimal time T_p = 2n/(p+1) but performs 2np/(p+1) operations.
  52 
  53 Lower Bound for the Prefix:  look at the multiplication circuit as a binary tree
  54 
  55 The problem:  To design a single algorithm that computes efficiently prefix (a) on an aritrary dynamic architecture
  56 
  57 Dynamic Architecutre: non-fixed number of resources, varible speeds.  e.g. grid, ... but not only: SMP server in multi-users mode.
  58 
  59 Processors-Oblivious Algorithms -- that's what what we want (?)
  60 
  61 Machine Model and Work Stealing
  62 -Heterogeneous machine model and work-depth framework
  63 -Distributed work stealing
  64 
  65 Heterogenous Processors, work and depth:
  66 Processor speeds are assumed to change arbitrarily and adversarily.  Model [Bender, Rabin 02] PI_i(t) = instantaneous speed
  67 of processor i at time t (in #unit operations per second).  Assumption:  PI_{max}< C*PI_min(t)
  68 Definition: for a computation with duration T:
  69 	Total speeD:  PI_{tot} = sum( sum( PI_i(t), t=0..T), i=0..P)
  70 	Average Speed per processor:  PI_{ave} = PI_{tot}/P
  71 	Work: W = #total number of operations performed
  72 	Depth: D = $operations on a critical path (~parallel "time" on infinite resources)
  73 For any greedy maximum utilization schedule: makespan <= W/(p*PI_{ave}) + (1-1/p)*(D/PI_{ave})
  74 
  75 The Work Stealing Algorithm:
  76 A distributed and randomized algorithm that computes a greedy schedule:
  77 -Each processor manages a local stack (depth-first execution)
  78 -When idle, a processor steals the topmost task on a remote non-idle victim processor (randomly chosen)
  79 -Theorem:  With good probability,
  80 	-#steals < P*D
  81 	-execution time <= W/(p*PI_{ave}) + O(D/Pi_{ave})
  82 -Interest:  If W indepenent of p and D is small, work stealing achieves near-optimal schedule.
  83 
  84 Work Stealing Implementation:
  85 efficient policy (close to optimal) <---- scheduling ---> control of the policy (realisation)
  86 Difficult in general (coarse grain)				Expensive in general (fine grain)
  87 But easy if D is small						But small overhead if small number of tasks
  88 Execution time as above (fine grain)					(coarse grain)
  89 
  90 If D is small, a work stealing algorithm performs a small number of steals
  91 	=> Work-first principle: "Scheduling overheads should be borne by the critical path of the computation"[Firggo 98]
  92 Implementation: since all tasks but a few are executed in the local stack, overhead of task creation should be as close as
  93 possible as sequentail function call
  94 At any time on any non-idle processor, efficient local degeneration of the parallel program in a sequential execuation
  95 
  96 Work Stealing implementaions following the work-first principle:  Cilk
  97 -Cilk-5 http://supertech.csail.mit.edu/cilk: C extension
  98 	-Spawn f(a); sync (serie-parallel programs)
  99 	-Requires a shared-memory machine
 100 	-Depth-first execution with synchronization (on sync) with the end of a task:
 101 		-spawned tasks are pushed in double-ended queue
 102 	-"Two-clone" compliation strategy [Frigo-Leiserson-Randal 98]
 103 		-On a successful steal, a thief executes the continuation on the topmost ready task
 104 		-When the continuation hasn't been stolen, "sync" = nop; else synchronization with its thief
 105 	-Won the 2006 award "Best Combination of Elegance and Performance"
 106 
 107 Work Stealing implementaions following the work-first principle:  KAAPI
 108 Kaapi/Athapascan  http://kaapi.gforge.inria  C++ library
 109 	-Fork<f>()(a,...) with access mode to parameters (value, read, write, r/w, cw) specified in f prototype (macro
 110 dataflow programs)
 111 	-Supports distributed and shared memory machines; heterogeneous processors
 112 	-Depth-first (reference order) execution with synchronization on data access
 113 		-Double-end queue ( mutual exclusion with compare-and-swap )
 114 		-more
 115 
 116 N-queens:  Takaken C sequential code parallelized in C++/Kaapi
 117 -T. Gautier & S. Guelton won the 2006 "Prix special du Jury" for the best performance at NQueens contest.
 118 -Some facts[on Grid'5000, a grid of processors of heterogeneous speeds]
 119 	-NQueens(21) in 78s on about 1000 processors
 120 	-NQueens(22) in 502.9s on 1458 processors
 121 	-NQueens(23) in 4435s on 1422 processors [~24 * 10^33 solutions]
 122 	-0.625% idle time per processor
 123 	-<20s to deploy up to 1000 processes on 1000 machine [Taktuk, Huard]
 124 	-15% more
 125 
 126 Work first principle and adaptability
 127 -Work-first principle:  Implicit dynamic choice between two executions:
 128 	-a sequential "depth-first" execution of the parallel algorithm (local, default)
 129 	-a parallel "breadth-first" one
 130 
 131 Extended work-stealing:  How do we get W_1 and W_{infinity} small?
 132 Concurrently sequential and parallel
 133 Based on the work-stealing and the work-first principle:
 134 	Instead of optimizing the sequential execution of the best parallel algorithm, let's optimize the parallel execution
 135 	of the best sequential algorithm
 136 Excecute always a sequential algorithm to reduce prallelism overhead
 137 	parallel algorithm is used only if a processor becomes idle (i.e. workstealing) to extract parallelism from the 
 138 	remaining work a sequential computation
 139 Assumption: Two concurrent algorithms that are complimentary:
 140 	-one sequenial (always performed, the priority
 141 	-the other parallel
 142 
 143 Extended work-stealing and granulairy
 144 -Scheme of the seuqential process: nanoloop
 145 	While (not completed(Wrem) ) and (next_operation hasn't been stole){
 146 		atomic {extract_next k operatoins ; Wrem -= k ;}
 147 		process the k operations extracted ;
 148 	}
 149 -Processor-oblivious algorithm:
 150 -Whatever p is, it performs O(p*D) preemption operations (continuation faults) -> D should be as small as possible to maximize
 151 both speed-up and locality
 152 
 153 Interactive application with time ocnstraint
 154 -Anytime algorithm:
 155 	-an be stopped at any time (with a result)
 156 
 157 Amortizing the arithmetic of parallelism
 158 Adaptive scheme: extract_seq/nanoloop // extract_par
 159 	-ensures an optimal number of operations on 1 processor but no guarantee of the work preformed on p processors
 160 
 161 E.G. (C++ STL): find_if(fist, last, predicate)
 162 locates the first element in [First,Last) verifying the predicate
 163 
 164 This may be a drawback:
 165 	-unneded processor usage
 166 	-undesirable for a library code that may be used in a complex application with many components
 167 	-(or not fair with other users)
 168 	-increases the time of the application: any parallelism that increases execution time should be avoided
 169 
 170 Similar to the nano-loop for the sequential processes:  that balances the -atomic- local work by the depth of the remaining one.
 171 Here, by amortizing the work induced by the extract_par operation, ensuring this work to be small enough:
 172 	-either wrt the useful work already performed
 173 	-or with respect to the useful work yet to be performed (if known)
 174 	-or both
 175 
 176 E.G.: find_if(First, Last, predicate):
 177 	-only the work already performed is known (on-line)
 178 	-then prevent to assign more than alpha(W_{done}) operations ot work-stealers
 179 	-Choices for alpha(n):
 180 		-n/2  similar to Floyd's iteration (approximation ratio = 2)
 181 		-n/log(n) : to ensure optimal usage of the work-stealers
 182 
 183 
 184 Putting things together:  Processor -oblivious prefix computation
 185 
 186 The critical path is put onto the parallel algorithm
 187 Analysis:
 188 Execution Time <= 2n/((p+1)*PI_{ave}) + O(log(n)/PI_{ave})
 189 
 190 Conclusion:
 191 -fine grain parallelism enables efficient execution on a small number of processors
 192 -Efficiency of classical work stealing relies on work-first principle
 193 -Processor Oblivious algorithms based on work-stealing/Work-first principle
 194 	-Based on anytime extraction of parallelism from any sequential algorithm (may execute different amts of operations)
 195 	-Oblivious: near optimal whatever the execution context is.
 196 -Generic scheme for stream computations:
 197 	-parallelism introduce a copy overhead from local buffers to the output gzip/compression, MPEG-4/H264
attachment:notes_roch.txt of msri07/notes

Attachment 'notes_roch.txt'

Attached Files