Attachment 'notes_roch.txt'
Download 1 11:30 a.m.
2 Processor-Oblivious Parallel Processing with Provable Performances
3 Jean-Louis Roch
4
5 Overview:
6 -Introduction
7 -Machine model and work-stealing
8 -Cheme 1 Adaptive Parallel algorithms
9 -Scheme 2 Nano-loop
10 -Scheme 3 Amoritizing the overhead of parallelism
11 -Putting things together
12
13 Interactive Parallel Computation:
14 Any application is "parallel":
15 -composition of several programs / library procedures (possibly concurrent)
16 -each procedure written independently and also possibly parallel, itself
17
18 Parallel Interactive Application:
19 -Human in the loop
20 -Prallel machines (cluster) to enable large interactive applications
21 -Two main performance criteria:
22 -Frequency (refresh rate)
23 -Visualization: 30-60Hz
24 -Haptic: 1000Hz
25 -Latency (makespan for one iteration)
26 -object handling: 75 ms
27
28 New Parallel Supports (from small to large)
29 -Multi-core architectures
30 -Dual Core procssors
31 -Dual Core graphics processors
32 -Heterogeneous multi-cores
33 -MPSoCs
34 -Commodity SMPs
35 -8-way PCs equipped with multicore processors (AMD Hypertransport) + 2 GPUs
36 -Clusters
37 -72% of top 500 machines
38 -Trends: more processing units, faster networks (PCI-Express)
39 -Heterogeneous (CPUs, GPUs, FPGAs)
40 -Grids
41 -Heterogeneous networks
42 -Heterogeneous administration policies
43 -Resource volatility
44 -Virtual Reality / Visualization Clusters
45 -Virtual Reality, Scientific Visualization and Computational Steering
46 -PC clusters + graphics cards + multiple I/O devices (cameras, 3D trackres, etc.)
47
48 Parallel induces overhead: E.G. Parallel prefix on fixed architecture
49 Prefix problem: Input a_0, ..., a_n. Output : sequential products. Serial performs only n opreations, serial performs 2n
50 but 2*log(n) time.
51 Optimal time T_p = 2n/(p+1) but performs 2np/(p+1) operations.
52
53 Lower Bound for the Prefix: look at the multiplication circuit as a binary tree
54
55 The problem: To design a single algorithm that computes efficiently prefix (a) on an aritrary dynamic architecture
56
57 Dynamic Architecutre: non-fixed number of resources, varible speeds. e.g. grid, ... but not only: SMP server in multi-users mode.
58
59 Processors-Oblivious Algorithms -- that's what what we want (?)
60
61 Machine Model and Work Stealing
62 -Heterogeneous machine model and work-depth framework
63 -Distributed work stealing
64
65 Heterogenous Processors, work and depth:
66 Processor speeds are assumed to change arbitrarily and adversarily. Model [Bender, Rabin 02] PI_i(t) = instantaneous speed
67 of processor i at time t (in #unit operations per second). Assumption: PI_{max}< C*PI_min(t)
68 Definition: for a computation with duration T:
69 Total speeD: PI_{tot} = sum( sum( PI_i(t), t=0..T), i=0..P)
70 Average Speed per processor: PI_{ave} = PI_{tot}/P
71 Work: W = #total number of operations performed
72 Depth: D = $operations on a critical path (~parallel "time" on infinite resources)
73 For any greedy maximum utilization schedule: makespan <= W/(p*PI_{ave}) + (1-1/p)*(D/PI_{ave})
74
75 The Work Stealing Algorithm:
76 A distributed and randomized algorithm that computes a greedy schedule:
77 -Each processor manages a local stack (depth-first execution)
78 -When idle, a processor steals the topmost task on a remote non-idle victim processor (randomly chosen)
79 -Theorem: With good probability,
80 -#steals < P*D
81 -execution time <= W/(p*PI_{ave}) + O(D/Pi_{ave})
82 -Interest: If W indepenent of p and D is small, work stealing achieves near-optimal schedule.
83
84 Work Stealing Implementation:
85 efficient policy (close to optimal) <---- scheduling ---> control of the policy (realisation)
86 Difficult in general (coarse grain) Expensive in general (fine grain)
87 But easy if D is small But small overhead if small number of tasks
88 Execution time as above (fine grain) (coarse grain)
89
90 If D is small, a work stealing algorithm performs a small number of steals
91 => Work-first principle: "Scheduling overheads should be borne by the critical path of the computation"[Firggo 98]
92 Implementation: since all tasks but a few are executed in the local stack, overhead of task creation should be as close as
93 possible as sequentail function call
94 At any time on any non-idle processor, efficient local degeneration of the parallel program in a sequential execuation
95
96 Work Stealing implementaions following the work-first principle: Cilk
97 -Cilk-5 http://supertech.csail.mit.edu/cilk: C extension
98 -Spawn f(a); sync (serie-parallel programs)
99 -Requires a shared-memory machine
100 -Depth-first execution with synchronization (on sync) with the end of a task:
101 -spawned tasks are pushed in double-ended queue
102 -"Two-clone" compliation strategy [Frigo-Leiserson-Randal 98]
103 -On a successful steal, a thief executes the continuation on the topmost ready task
104 -When the continuation hasn't been stolen, "sync" = nop; else synchronization with its thief
105 -Won the 2006 award "Best Combination of Elegance and Performance"
106
107 Work Stealing implementaions following the work-first principle: KAAPI
108 Kaapi/Athapascan http://kaapi.gforge.inria C++ library
109 -Fork<f>()(a,...) with access mode to parameters (value, read, write, r/w, cw) specified in f prototype (macro
110 dataflow programs)
111 -Supports distributed and shared memory machines; heterogeneous processors
112 -Depth-first (reference order) execution with synchronization on data access
113 -Double-end queue ( mutual exclusion with compare-and-swap )
114 -more
115
116 N-queens: Takaken C sequential code parallelized in C++/Kaapi
117 -T. Gautier & S. Guelton won the 2006 "Prix special du Jury" for the best performance at NQueens contest.
118 -Some facts[on Grid'5000, a grid of processors of heterogeneous speeds]
119 -NQueens(21) in 78s on about 1000 processors
120 -NQueens(22) in 502.9s on 1458 processors
121 -NQueens(23) in 4435s on 1422 processors [~24 * 10^33 solutions]
122 -0.625% idle time per processor
123 -<20s to deploy up to 1000 processes on 1000 machine [Taktuk, Huard]
124 -15% more
125
126 Work first principle and adaptability
127 -Work-first principle: Implicit dynamic choice between two executions:
128 -a sequential "depth-first" execution of the parallel algorithm (local, default)
129 -a parallel "breadth-first" one
130
131 Extended work-stealing: How do we get W_1 and W_{infinity} small?
132 Concurrently sequential and parallel
133 Based on the work-stealing and the work-first principle:
134 Instead of optimizing the sequential execution of the best parallel algorithm, let's optimize the parallel execution
135 of the best sequential algorithm
136 Excecute always a sequential algorithm to reduce prallelism overhead
137 parallel algorithm is used only if a processor becomes idle (i.e. workstealing) to extract parallelism from the
138 remaining work a sequential computation
139 Assumption: Two concurrent algorithms that are complimentary:
140 -one sequenial (always performed, the priority
141 -the other parallel
142
143 Extended work-stealing and granulairy
144 -Scheme of the seuqential process: nanoloop
145 While (not completed(Wrem) ) and (next_operation hasn't been stole){
146 atomic {extract_next k operatoins ; Wrem -= k ;}
147 process the k operations extracted ;
148 }
149 -Processor-oblivious algorithm:
150 -Whatever p is, it performs O(p*D) preemption operations (continuation faults) -> D should be as small as possible to maximize
151 both speed-up and locality
152
153 Interactive application with time ocnstraint
154 -Anytime algorithm:
155 -an be stopped at any time (with a result)
156
157 Amortizing the arithmetic of parallelism
158 Adaptive scheme: extract_seq/nanoloop // extract_par
159 -ensures an optimal number of operations on 1 processor but no guarantee of the work preformed on p processors
160
161 E.G. (C++ STL): find_if(fist, last, predicate)
162 locates the first element in [First,Last) verifying the predicate
163
164 This may be a drawback:
165 -unneded processor usage
166 -undesirable for a library code that may be used in a complex application with many components
167 -(or not fair with other users)
168 -increases the time of the application: any parallelism that increases execution time should be avoided
169
170 Similar to the nano-loop for the sequential processes: that balances the -atomic- local work by the depth of the remaining one.
171 Here, by amortizing the work induced by the extract_par operation, ensuring this work to be small enough:
172 -either wrt the useful work already performed
173 -or with respect to the useful work yet to be performed (if known)
174 -or both
175
176 E.G.: find_if(First, Last, predicate):
177 -only the work already performed is known (on-line)
178 -then prevent to assign more than alpha(W_{done}) operations ot work-stealers
179 -Choices for alpha(n):
180 -n/2 similar to Floyd's iteration (approximation ratio = 2)
181 -n/log(n) : to ensure optimal usage of the work-stealers
182
183
184 Putting things together: Processor -oblivious prefix computation
185
186 The critical path is put onto the parallel algorithm
187 Analysis:
188 Execution Time <= 2n/((p+1)*PI_{ave}) + O(log(n)/PI_{ave})
189
190 Conclusion:
191 -fine grain parallelism enables efficient execution on a small number of processors
192 -Efficiency of classical work stealing relies on work-first principle
193 -Processor Oblivious algorithms based on work-stealing/Work-first principle
194 -Based on anytime extraction of parallelism from any sequential algorithm (may execute different amts of operations)
195 -Oblivious: near optimal whatever the execution context is.
196 -Generic scheme for stream computations:
197 -parallelism introduce a copy overhead from local buffers to the output gzip/compression, MPEG-4/H264
Attached Files
To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.You are not allowed to attach a file to this page.