Differences between revisions 13 and 19 (spanning 6 versions)
Revision 13 as of 2008-06-14 18:39:01
Size: 2838
Comment:
Revision 19 as of 2008-06-20 20:50:03
Size: 4046
Comment:
Deletions are marked like this. Additions are marked like this.
Line 18: Line 18:
 * implement LQUP decomposition [Clement, MartinAlbrecht]
   * implement LQUP routine [Clement]
   * implement TRSM routine [Clement]
   * implement efficient column swaps/rotations [MartinAlbrecht]
     * SSE2 might help a lot here
   * implement memory efficient mzd_addmul_strassen [Martin]
 * implement LQUP decomposition [ClementPernet, MartinAlbrecht]
   * --(implement LQUP routine)-- [Clement]
   * implement {{{_mzd_addmul_weird}}} [Clement]
   * implement LQUP basecase routine based on M4RI {{{mzd_lqup_m4ri}}} [MartinAlbrecht]
     * dont' zero out in _mzd_gauss_submatrix (!)
     * don't zero out below
     * remember rowswaps => P
     * remember Q[r] = c (translate to Lapack style later :-) )
   * --(implement TRSM routine)-- [Clement]
   * implement inplace triangular matrix inversion {{{mzd_trtri}}} [Clement]
   * implement triangular triangular matrix multiplication {{{mzd_trtrm}}} [Clement]
   * --(implement column swaps)-- [MartinAlbrecht]
   * --(implement column rotations)-- [MartinAlbrecht]
   * --(implement memory efficient mzd_addmul_strassen)-- [Martin]
Line 25: Line 33:
 * implement Arne's asymptotically fast elimination algorithm [MartinAlbrecht]  * --(implement Arne's asymptotically fast elimination algorithm)-- [MartinAlbrecht]
   * It was agreed that this is not what we want for M4RI, a toy implementation is available at: http://www.informatik.uni-bremen.de/cgi-bin/cgiwrap/malb/blosxom.pl/2008/06/13#gauss-jordan-storjohann
Line 30: Line 39:
   * try 7 instead of 8 Gray code tables to leave room for the actual matrix in L1 [MartinAlbrecht]    * --(try 7 instead of 8 Gray code tables to leave room for the actual matrix in L1)-- [MartinAlbrecht]
Line 32: Line 41:
     * L1 cache doesn't seem to be the main reason for performance then
     * try say 16 tables.
   * try to fit three matrices rather than two into L2 or understand why it works so good for two
   * detect L1/L2 cache sizes at runtime and choose optimal parameters for them
   * try say 16 tables instead of 8 on the Core2Duo [MartinAlbrecht]
   * try Bill Hart's idea for L1 cache efficiency on the Core2Duo [MartinAlbrecht]
     * The idea is: construct 8 gray code tables but use one once it is in L1 a lot to reduce before using the second
   * --(try to fit three matrices rather than two into L2 or understand why it works so good for two)-- [MartinAlbrecht]
     * it works since once we've written the date we can go on computing and let a cache handle the rest
     * a cache miss for reading on the other hand really prevents us from computing
   * --(detect L1/L2 cache sizes at runtime and choose optimal parameters for them)-- [MartinAlbrecht]
   * Implement memory management s.t. the sys time on Opterons decreases.

Dev Days 1: Exact Linear Algebra

  • Michael Abshoff -> LinBox debianization

  • MartinAlbrecht -> M4RI GF(2)

  • Nick Alexander -> cylcotomic linalg

  • Gregory Bard -> M4RI

  • Rob Beezer
  • Tom Boothby
  • Craig Citro -> cyclotomic linalg, p-adic linalg

  • Burcin Erocal -> nullspace of polynomial matrices

  • Robert Miller (especially sparse GF(2)) -> sparse GF(2)

  • David Roe (p-adic linear algebra?) -> cyclotomic linear algebra

  • Arne Storjohann -> HNF, Sparse GF(2) Linalg

  • William Stein -> Cyclotomic linear algebra, HNF (LLL)

  • Ralf-Philipp Weinmann -> Sparse Elimination, Nullspace

Dense GF(2)

  • implement LQUP decomposition [ClementPernet, MartinAlbrecht]

    • implement LQUP routine [Clement]

    • implement _mzd_addmul_weird [Clement]

    • implement LQUP basecase routine based on M4RI mzd_lqup_m4ri [MartinAlbrecht]

      • dont' zero out in _mzd_gauss_submatrix (!)

      • don't zero out below
      • remember rowswaps => P

      • remember Q[r] = c (translate to Lapack style later :-) )

    • implement TRSM routine [Clement]

    • implement inplace triangular matrix inversion mzd_trtri [Clement]

    • implement triangular triangular matrix multiplication mzd_trtrm [Clement]

    • implement column swaps [MartinAlbrecht]

    • implement column rotations [MartinAlbrecht]

    • implement memory efficient mzd_addmul_strassen [Martin]

      • See Clement's et al. paper on memory efficient Strassen-Winograd
  • implement Arne's asymptotically fast elimination algorithm [MartinAlbrecht]

  • implement multi-core multiplication with optimal speed-up
    • OpenMP seems to be nice and easy
    • 2 cores probably main target, but think about 4 cores too
  • improve efficiency of M4RM
    • try 7 instead of 8 Gray code tables to leave room for the actual matrix in L1 [MartinAlbrecht]

      • It seems slower to be slower to use 7 tables rather than 8
    • try say 16 tables instead of 8 on the Core2Duo [MartinAlbrecht]

    • try Bill Hart's idea for L1 cache efficiency on the Core2Duo [MartinAlbrecht]

      • The idea is: construct 8 gray code tables but use one once it is in L1 a lot to reduce before using the second
    • try to fit three matrices rather than two into L2 or understand why it works so good for two [MartinAlbrecht]

      • it works since once we've written the date we can go on computing and let a cache handle the rest
      • a cache miss for reading on the other hand really prevents us from computing
    • detect L1/L2 cache sizes at runtime and choose optimal parameters for them [MartinAlbrecht]

    • Implement memory management s.t. the sys time on Opterons decreases.
    • implement Bill's [http://groups.google.com/group/sage-devel/msg/6279228095b3d9f7 half table idea] and benchmark it

Sparse GF(2) (and other small finite fields)

  • [Arne, Ralph, Clément, Rob Miller]
  • Sparse Reduced Echelon form (RPW)
    • Sparse Elimination:
      • improve LinBox gauss-domain

      • eclib sparse elimination
      • ....
  • ....

Hermite Normal Form

  • [Arne, Clément, William]
  • new algorithm, based on system solving (Arne Storjohann) -> already an implementation

    • integrate it in IML
    • benchmark it against Sage
  • generalization: block vector system solving
    • implementation
    • benchmark
  • LLL trick for the existing implementation in Sage (William & Clément)

Polynomial Matrix computations

  • [Burcin]
  • Nullspace computation (Burcin)

Cyclotomic linear algebra

  • [Rob Bradshaw, Craig, William]

P-adic linear algebra

  • [David Roe, Craig, David Cohet]

LinBox debianization and 1.1.6 spkgization

  • [mabshoff, Clément]
  • get rid of gmp++
  • generate and test linbox-1.1.6rc0.spkg

dev1/linalg (last edited 2008-11-14 13:41:56 by anonymous)