Talk: Parallel Perspectives for the LinBox Library Clement Pernet Exact Linear Algebra: Building block in exact computation. Topology: Smith form. Graph Theory: Characteristic Polynomial. Rep. Theory: Null space Cryptography: sparse system resolution. The matricies involved can get very very large. etc. Software Libraries for Exact Computations: finite fields: NTL, GMP, Lidia, Pari, ... polynomials: NTL, ... integers: GMP, ... Global Solutions: Maple Magma Sage Linear Algebra? It falls somewhere in between. In global solutions, it's not always as efficient as it could be. This is where LinBox tries to intervene, linking the global solutions with the libraries. LinBox is genereic middleware. Maple, GAP, Sage ----> LinBox ---> Finite Fields (NTL, Givaro, ...) , BLAS (ATLAS, GOTO, ...), GMP Joint NSF-NSERC-CNRS project. -U Delaware, NC State -U Waterloo, U Calgary -Laboratoires, ... (missed) Solutions: rank, det, minpoly, charpoly, system solve, positive definiteness over domains: finite fields, ZZ, QQ for matricies: sparse, structured; dense A design for genericity: Field/Ring interface: -Shared interace with Givaro -Wraps NTL, Lidia, Givaro implementations using archetype or envelopes -Proper implementations suited for dense computations (mainly rely on FLOp arithmetic) Matrix interface Structure of the Library: Solutions (det, rank) - spcify the method, domain -> Algorithms -damnit Several levels of use: Web Servers: http://www.linalg.org/ Executables:$ charpoly MyMatrix 65521 Call to a soltuion: NTL::ZZp F(65521); Toeplitz A(F); Polynomial P; charpoly (P, A); Calls to specific algorithms Dense computations: Bilding block: matrix multiplication over word-size finite field Principle: -Delayed modular reduction -Floating Point arithmetic (fused-mac, SSE2, ...) -BLAS cache management. -Sub-cubic algorithm (Winograd) Design of other dense routines: -Reduction to matrix multiplication -Bounds for delayed modular operations -Block algorithm with multiple cascade. Char Poly: Fact: O(n^omega) Las Vegas probabilistic algorithm for the computation of the char poly over a Field. This new algorithm is also practical. Virtually always beats the LU-Krylov for n>100 BlackBox Compuations: Goal: computation with very large sparse or structured matricies. -No explicit rep of matrix -Only operation: application of a vector -Efficient algorithms -Efficient preconditioners: Toeplitz, Hankel, Butterfly, ... ... Block Projection Algorithms: -Wiedemann algorithm: scalar projection sof A^i for i=1..2d -Block Wiedemann: kxk dense projections of A^i for i=1..2d/k -balance between blackbox and dense applications Data Containers/Iterators: Distinction etween computation and access to the data: Example: Iterates (u^TA^iv)_{i=1..k} used for system resolution can be -Precomputed and stored -computed on the fly -computed in parallel Solution: Solver defined using generic iterators, independantly from the method to compute the data. Existing ocntainers.iterators: Scalar projections: (v^TA^iu)_{i=1..k} --> Wiedemann's algorithm Block projections: (AV_i)_{i=1..k} --> Block Wiedemann Modular homomorphic imaging: (Algortihm(A mod p_i))_{i=1..k} --> Chinese Remainder Algorithm No modification of high-level algorithms for paralleliztion Parallel tools: Until now, fer paralleliztaions: -Attempts with MPI and POSIX threads -Higher level systems: Athapascan-1, KAAPI -Full design compatibility -missed Example: rank computations: [Dumas & urbanska] -Parallel block Wiedemann algorithm: [u_1,...,u_k]^T(GG^T)u_i, i=1..k -Only rank(g)/k iterations -Combined with sigma basis algorithm Matrix: GL7d17, n=1,548,650 m=955,128 rank=626910 Time estimation: T_{iter} 0.46875 min. T_{seq} 621.8 days. T_{par}(50) 12.4 days. T_{par}(50,ET) 8.16 days TURBO triangular elimination: [Roch and Dumas 02]: recursive block algorithm for triangularization -divide both rows and columns -better memory management -Enables to use recursive data structures -5 recursive calls (U,V,C,D,Z), including 2 being paralle (C,D) Principle of Workstealing [Arora, Blumofe, Plaxton01], [Acar, Belloche, Blumofe02] -2 algorithms to complete a task f: f_seq and f_par -When a processor becomes idle, ExtractPar steals the work to f_seq Application to multiple triangular system solving: TRSM : Compute <|<>^(-1) <> x_2 = TRSM(U_3,B_2), B_1 = B_1 - U_2X_2, X_1 = TRSM(U_1,B_1) f_seq: TRSM(U,B) --> T_1 = n^3, t_{infinity} = O(n) f_par: Compute V = U^(-1); TRMM(V,B); --> t_1 = 4/3 n^3. T_{infinity} = O(log(n)) When sequential TRSM and parallel Inverse join: Computer X_1 = A_1^(-1)B_1 in parallel (TRMM) BOX( Top down inverse going down) (Bottom-up TRSM coming up) Multi-adic lifting: Solving Ax = b over ZZ Standard p-adic Lifting [Dixon 82] Compute A^(-1) mod p r=b for i=0..n do x_i = A^(-1)r mod p r = (r-Ax_i)/p end for z = x_0 + px_1 + ... + x_np^n x = RatReconst(z) end Multi-adic lifting: for all j=1..k do compute A^(-1) mod p_j r=b for i=0..n/k do x_i = A^(-1)r mod p_i r = (r-Ax_i)/p_j end for z_j = x_0 + p_jx_1 + ... + p^{n/k}_jx_{n/k} end for Z = ChineseRemainderAlg((z_j,p^{n/k}_j)_{j=1..k}) X= RatReconst(Z) end Complexity of this algorithm is worse, but can be made faster in practice (??) -Used for sequential computation [Chen and Storjohann 05], to balance fficiency between BLAS levels 02 and 03 (?) Conclusion: Large Grain parallelism: -Chinese Remaindering -Multi-adic lifting -Block Wiedermann Fine Grain Adaptive Parallelism: -Work Stealing Perspectives: -Development of simple paralle containers -Parallel distribution of LinBox, based on Kaapi. LinBox does not use "Greasing" techniques over finite fields Multiple organizations worry about standards, esp. concerning matrix multiplication over small prime fields. There will be more talk about this later in the week. The problem comes in that there are many different arithmetics to choose from.