Attachment 'notes_pernet.txt'
Download 1 Talk: Parallel Perspectives for the LinBox Library
2 Clement Pernet
3
4 Exact Linear Algebra:
5 Building block in exact computation. Topology: Smith form. Graph Theory: Characteristic Polynomial. Rep. Theory: Null space Cryptography: sparse system resolution.
6 The matricies involved can get very very large. etc.
7
8 Software Libraries for Exact Computations:
9 finite fields: NTL, GMP, Lidia, Pari, ...
10 polynomials: NTL, ...
11 integers: GMP, ...
12
13 Global Solutions:
14 Maple
15 Magma
16 Sage
17
18 Linear Algebra? It falls somewhere in between. In global solutions, it's not always as efficient as it could be. This is
19 where LinBox tries to intervene, linking the global solutions with the libraries.
20
21 LinBox is genereic middleware.
22
23 Maple, GAP, Sage ----> LinBox ---> Finite Fields (NTL, Givaro, ...) , BLAS (ATLAS, GOTO, ...), GMP
24
25 Joint NSF-NSERC-CNRS project.
26 -U Delaware, NC State
27 -U Waterloo, U Calgary
28 -Laboratoires, ... (missed)
29
30 Solutions: rank, det, minpoly, charpoly, system solve, positive definiteness over domains: finite fields, ZZ, QQ for matricies: sparse, structured; dense
31
32 A design for genericity:
33 Field/Ring interface:
34 -Shared interace with Givaro
35 -Wraps NTL, Lidia, Givaro implementations using archetype or envelopes
36 -Proper implementations suited for dense computations (mainly rely on FLOp arithmetic) Matrix interface
37
38 Structure of the Library:
39 Solutions (det, rank) - spcify the method, domain -> Algorithms -damnit
40
41 Several levels of use:
42 Web Servers: http://www.linalg.org/
43 Executables:$ charpoly MyMatrix 65521
44 Call to a soltuion:
45 NTL::ZZp F(65521);
46 Toeplitz<NTL::ZZp> A(F);
47 Polynomial<NTL::ZZp> P;
48 charpoly (P, A);
49 Calls to specific algorithms
50
51 Dense computations:
52 Bilding block: matrix multiplication over word-size finite field
53 Principle:
54 -Delayed modular reduction
55 -Floating Point arithmetic (fused-mac, SSE2, ...)
56 -BLAS cache management.
57 -Sub-cubic algorithm (Winograd)
58
59 Design of other dense routines:
60 -Reduction to matrix multiplication
61 -Bounds for delayed modular operations
62 -Block algorithm with multiple cascade.
63
64 Char Poly:
65 Fact: O(n^omega) Las Vegas probabilistic algorithm for the computation of the char poly over a Field.
66 This new algorithm is also practical. Virtually always beats the LU-Krylov for n>100
67
68 BlackBox Compuations:
69 Goal: computation with very large sparse or structured matricies.
70 -No explicit rep of matrix
71 -Only operation: application of a vector
72 -Efficient algorithms
73 -Efficient preconditioners: Toeplitz, Hankel, Butterfly, ...
74 ...
75
76 Block Projection Algorithms:
77 -Wiedemann algorithm: scalar projection sof A^i for i=1..2d
78 -Block Wiedemann: kxk dense projections of A^i for i=1..2d/k
79 -balance between blackbox and dense applications
80
81 Data Containers/Iterators:
82 Distinction etween computation and access to the data:
83 Example: Iterates (u^TA^iv)_{i=1..k} used for system resolution can be
84 -Precomputed and stored
85 -computed on the fly
86 -computed in parallel
87 Solution: Solver defined using generic iterators, independantly from the method to compute the data.
88
89 Existing ocntainers.iterators:
90 Scalar projections: (v^TA^iu)_{i=1..k} --> Wiedemann's algorithm
91 Block projections: (AV_i)_{i=1..k} --> Block Wiedemann
92 Modular homomorphic imaging: (Algortihm(A mod p_i))_{i=1..k} --> Chinese Remainder Algorithm
93 No modification of high-level algorithms for paralleliztion
94
95
96 Parallel tools:
97 Until now, fer paralleliztaions:
98 -Attempts with MPI and POSIX threads
99 -Higher level systems: Athapascan-1, KAAPI
100 -Full design compatibility
101 -missed
102
103 Example: rank computations:
104 [Dumas & urbanska]
105 -Parallel block Wiedemann algorithm: [u_1,...,u_k]^T(GG^T)u_i, i=1..k
106 -Only rank(g)/k iterations
107 -Combined with sigma basis algorithm
108 Matrix: GL7d17, n=1,548,650 m=955,128 rank=626910
109 Time estimation: T_{iter} 0.46875 min. T_{seq} 621.8 days. T_{par}(50) 12.4 days. T_{par}(50,ET) 8.16 days
110
111
112 TURBO triangular elimination:
113 [Roch and Dumas 02]: recursive block algorithm for triangularization
114 -divide both rows and columns
115 -better memory management
116 -Enables to use recursive data structures
117 -5 recursive calls (U,V,C,D,Z), including 2 being paralle (C,D)
118
119 Principle of Workstealing
120 [Arora, Blumofe, Plaxton01], [Acar, Belloche, Blumofe02]
121 -2 algorithms to complete a task f: f_seq and f_par
122 -When a processor becomes idle, ExtractPar steals the work to f_seq
123
124 Application to multiple triangular system solving:
125 TRSM : Compute <<U_1,0>|<<U_2,U_3>>^(-1) <<B_1,B_2>> x_2 = TRSM(U_3,B_2), B_1 = B_1 - U_2X_2, X_1 = TRSM(U_1,B_1)
126 f_seq: TRSM(U,B) --> T_1 = n^3, t_{infinity} = O(n)
127 f_par: Compute V = U^(-1); TRMM(V,B); --> t_1 = 4/3 n^3. T_{infinity} = O(log(n))
128
129 When sequential TRSM and parallel Inverse join: Computer X_1 = A_1^(-1)B_1 in parallel (TRMM)
130 BOX( Top down inverse going down) (Bottom-up TRSM coming up)
131
132 Multi-adic lifting: Solving Ax = b over ZZ
133 Standard p-adic Lifting [Dixon 82]
134 Compute A^(-1) mod p
135 r=b
136 for i=0..n do
137 x_i = A^(-1)r mod p
138 r = (r-Ax_i)/p
139 end for
140 z = x_0 + px_1 + ... + x_np^n
141 x = RatReconst(z)
142 end
143
144 Multi-adic lifting:
145 for all j=1..k do
146 compute A^(-1) mod p_j
147 r=b
148 for i=0..n/k do
149 x_i = A^(-1)r mod p_i
150 r = (r-Ax_i)/p_j
151 end for
152 z_j = x_0 + p_jx_1 + ... + p^{n/k}_jx_{n/k}
153 end for
154 Z = ChineseRemainderAlg((z_j,p^{n/k}_j)_{j=1..k})
155 X= RatReconst(Z)
156 end
157 Complexity of this algorithm is worse, but can be made faster in practice (??)
158 -Used for sequential computation [Chen and Storjohann 05], to balance fficiency between BLAS levels 02 and 03 (?)
159
160
161 Conclusion:
162 Large Grain parallelism:
163 -Chinese Remaindering
164 -Multi-adic lifting
165 -Block Wiedermann
166 Fine Grain Adaptive Parallelism:
167 -Work Stealing
168 Perspectives:
169 -Development of simple paralle containers
170 -Parallel distribution of LinBox, based on Kaapi.
171
172 LinBox does not use "Greasing" techniques over finite fields
173 Multiple organizations worry about standards, esp. concerning matrix multiplication over small prime fields. There will be
174 more talk about this later in the week.
175 The problem comes in that there are many different arithmetics to choose from.
Attached Files
To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.You are not allowed to attach a file to this page.