attachment:notes_pernet.txt of msri07/notes

Toggle line numbers
   1 Talk:  Parallel Perspectives for the LinBox Library
   2 Clement Pernet
   3 
   4 Exact Linear Algebra:
   5 Building block in exact computation.  Topology: Smith form.  Graph Theory:  Characteristic Polynomial.  Rep. Theory: Null space Cryptography: sparse system resolution.
   6 The matricies involved can get very very large. etc.
   7 
   8 Software Libraries for Exact Computations:
   9 finite fields:  NTL, GMP, Lidia, Pari, ...
  10 polynomials: NTL, ...
  11 integers: GMP, ...
  12 
  13 Global Solutions:
  14 Maple
  15 Magma
  16 Sage
  17 
  18 Linear Algebra?  It falls somewhere in between.  In global solutions, it's not always as efficient as it could be.  This is
  19 where LinBox tries to intervene, linking the global solutions with the libraries.
  20 
  21 LinBox is genereic middleware.
  22 
  23 Maple, GAP, Sage  ---->  LinBox ---> Finite Fields (NTL, Givaro, ...) , BLAS (ATLAS, GOTO, ...), GMP
  24 
  25 Joint NSF-NSERC-CNRS project.
  26 	-U Delaware, NC State
  27 	-U Waterloo, U Calgary
  28 	-Laboratoires, ... (missed)
  29 
  30 Solutions:  rank, det, minpoly, charpoly, system solve, positive definiteness over domains:  finite fields, ZZ, QQ for matricies:  sparse, structured; dense
  31 
  32 A design for genericity:
  33 Field/Ring interface:
  34 	-Shared interace with Givaro
  35 	-Wraps NTL, Lidia, Givaro implementations using archetype or envelopes
  36 	-Proper implementations suited for dense computations  (mainly rely on FLOp arithmetic) Matrix interface
  37 
  38 Structure of the Library:
  39 Solutions (det, rank) - spcify the method, domain -> Algorithms -damnit
  40 
  41 Several levels of use:
  42 	Web Servers: http://www.linalg.org/
  43 	Executables:$ charpoly MyMatrix 65521
  44 	Call to a soltuion:
  45 		NTL::ZZp F(65521);
  46 		Toeplitz<NTL::ZZp> A(F);
  47 		Polynomial<NTL::ZZp> P;
  48 		charpoly (P, A);
  49 	Calls to specific algorithms
  50 
  51 Dense computations:
  52 Bilding block:  matrix multiplication over word-size finite field
  53 Principle:
  54 	-Delayed modular reduction
  55 	-Floating Point arithmetic (fused-mac, SSE2, ...)
  56 	-BLAS cache management.
  57 	-Sub-cubic algorithm (Winograd)
  58 
  59 Design of other dense routines:
  60 	-Reduction to matrix multiplication
  61 	-Bounds for delayed modular operations
  62 		-Block algorithm with multiple cascade.
  63 
  64 Char Poly:
  65 Fact:  O(n^omega) Las Vegas probabilistic algorithm for the computation of the char poly over a Field.
  66 This new algorithm is also practical.  Virtually always beats the LU-Krylov for n>100
  67 
  68 BlackBox Compuations:
  69 Goal: computation with very large sparse or structured matricies.
  70 	-No explicit rep of matrix
  71 	-Only operation: application of a vector
  72 	-Efficient algorithms
  73 	-Efficient preconditioners:  Toeplitz, Hankel, Butterfly, ...
  74 	...
  75 
  76 Block Projection Algorithms:
  77 	-Wiedemann algorithm:  scalar projection sof A^i for i=1..2d
  78 	-Block Wiedemann:  kxk dense projections of A^i for i=1..2d/k
  79 		-balance between blackbox and dense applications
  80 
  81 Data Containers/Iterators:
  82 Distinction etween computation and access to the data:
  83 Example:  Iterates (u^TA^iv)_{i=1..k} used for system resolution can be
  84 	-Precomputed and stored
  85 	-computed on the fly
  86 	-computed in parallel
  87 Solution:  Solver defined using generic iterators, independantly from the method to compute the data.
  88 
  89 Existing ocntainers.iterators:
  90 	Scalar projections:  (v^TA^iu)_{i=1..k}   --> Wiedemann's algorithm
  91 	Block projections:  (AV_i)_{i=1..k}   --> Block Wiedemann
  92 	Modular homomorphic imaging:  (Algortihm(A mod p_i))_{i=1..k}   --> Chinese Remainder Algorithm
  93 No modification of high-level algorithms for paralleliztion
  94 
  95 
  96 Parallel tools:
  97 Until now, fer paralleliztaions:
  98 	-Attempts with MPI and POSIX threads
  99 	-Higher level systems:  Athapascan-1, KAAPI
 100 		-Full design compatibility
 101 		-missed
 102 
 103 Example: rank computations:
 104 [Dumas & urbanska]
 105 	-Parallel block Wiedemann algorithm: [u_1,...,u_k]^T(GG^T)u_i, i=1..k
 106 		-Only rank(g)/k iterations
 107 	-Combined with sigma basis algorithm
 108 	Matrix:  GL7d17, n=1,548,650  m=955,128   rank=626910
 109 	Time estimation:  T_{iter} 0.46875 min.  T_{seq} 621.8 days.  T_{par}(50)  12.4 days.  T_{par}(50,ET)  8.16 days
 110 
 111 
 112 TURBO triangular elimination:
 113 [Roch and Dumas 02]:  recursive block algorithm for triangularization
 114 	-divide both rows and columns
 115 		-better memory management
 116 		-Enables to use recursive data structures
 117 	-5 recursive calls (U,V,C,D,Z), including 2 being paralle (C,D)
 118 
 119 Principle of Workstealing
 120 [Arora, Blumofe, Plaxton01], [Acar, Belloche, Blumofe02]
 121 -2 algorithms to complete a task f: f_seq and f_par
 122 -When a processor becomes idle, ExtractPar steals the work to f_seq
 123 
 124 Application to multiple triangular system solving:
 125 TRSM : Compute <<U_1,0>|<<U_2,U_3>>^(-1) <<B_1,B_2>>   x_2 = TRSM(U_3,B_2), B_1  = B_1 - U_2X_2, X_1 = TRSM(U_1,B_1)
 126 f_seq:  TRSM(U,B) --> T_1 = n^3, t_{infinity} = O(n)
 127 f_par:  Compute V = U^(-1);  TRMM(V,B);  --> t_1 = 4/3 n^3.  T_{infinity} = O(log(n))
 128 
 129 When sequential TRSM and parallel Inverse join:  Computer X_1 = A_1^(-1)B_1 in parallel (TRMM)  
 130 BOX( Top down inverse going down)   (Bottom-up TRSM coming up)
 131 
 132 Multi-adic lifting:  Solving Ax = b over ZZ
 133 Standard p-adic Lifting [Dixon 82]
 134 Compute A^(-1) mod p
 135 r=b
 136 for i=0..n do
 137 	x_i = A^(-1)r mod p
 138 	r = (r-Ax_i)/p
 139 end for
 140 z = x_0 + px_1 + ... + x_np^n
 141 x = RatReconst(z)
 142 end
 143  
 144 Multi-adic lifting:
 145 for all j=1..k do
 146 	compute A^(-1) mod p_j
 147 	r=b
 148 	for i=0..n/k do
 149 		x_i = A^(-1)r mod p_i
 150 		r = (r-Ax_i)/p_j
 151 	end for
 152 	z_j = x_0 + p_jx_1 + ... + p^{n/k}_jx_{n/k}
 153 end for
 154 Z = ChineseRemainderAlg((z_j,p^{n/k}_j)_{j=1..k})
 155 X= RatReconst(Z)
 156 end
 157 Complexity of this algorithm is worse, but can be made faster in practice (??)
 158 	-Used for sequential computation [Chen and Storjohann 05], to balance fficiency between BLAS levels 02 and 03 (?)
 159 
 160 
 161 Conclusion:
 162   Large Grain parallelism:
 163 	-Chinese Remaindering
 164 	-Multi-adic lifting
 165 	-Block Wiedermann
 166   Fine Grain Adaptive Parallelism:
 167 	-Work Stealing
 168   Perspectives:
 169 	-Development of simple paralle containers
 170 	-Parallel distribution of LinBox, based on Kaapi.
 171 
 172 LinBox does not use "Greasing" techniques over finite fields
 173 Multiple organizations worry about standards, esp. concerning matrix multiplication over small prime fields.  There will be
 174 more talk about this later in the week.
 175 	The problem comes in that there are many different arithmetics to choose from.
attachment:notes_pernet.txt of msri07/notes

Attachment 'notes_pernet.txt'

Attached Files