Differences between revisions 1 and 7 (spanning 6 versions)

Dev Days 1: Exact Linear Algebra

GF(2)

implement LQUP decomposition [Clement, Martin]
- implement LQUP routine [Clement]
- implement TRSM routine [Clement]
- implement efficient column swaps/rotations [Martin]
  - SSE2 might help a lot here
- implement memory efficient mzd_addmul_strassen [Martin]
  - See Clement's et al. paper on memory efficient Strassen-Winograd
implement Arne's asymptotically fast elimination algorithm [Martin]
implement multi-core multiplication with optimal speed-up
- OpenMP seems to be nice and easy
- 2 cores probably main target, but think about 4 cores too
improve efficiency of M4RM
- try 7 instead of 8 Gray code tables to leave room for the actual matrix in L1
- try to fit three matrices rather than two into L2 or understand why it works so good for two
- detect L1/L2 cache sizes at runtime and choose optimal parameters for them
- implement Bill's half table idea and benchmark it

-  ⇤ ← Revision 1 as of 2008-05-20 04:27:25 → 
  Size: 296
  Editor: was
  Comment:
+   ← Revision 7 as of 2008-06-13 04:11:12 → ⇥
  Size: 1301
  Editor: MartinAlbrecht
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
-  * Grebory Bard
+  * Gregory Bard
 Line 11:
+  * Robert Miller (especially sparse GF(2))
-Line 15:
+Line 16:
+== GF(2) ==
 * implement LQUP decomposition [Clement, Martin]
   * implement LQUP routine [Clement]
   * implement TRSM routine [Clement]
   * implement efficient column swaps/rotations [Martin]
     * SSE2 might help a lot here
   * implement memory efficient mzd_addmul_strassen [Martin]
     * See Clement's et al. paper on memory efficient Strassen-Winograd
 * implement Arne's asymptotically fast elimination algorithm [Martin]
 * implement multi-core multiplication with optimal speed-up
   * OpenMP seems to be nice and easy
   * 2 cores probably main target, but think about 4 cores too
 * improve efficiency of M4RM
   * try 7 instead of 8 Gray code tables to leave room for the actual matrix in L1
   * try to fit three matrices rather than two into L2 or understand why it works so good for two
   * detect L1/L2 cache sizes at runtime and choose optimal parameters for them
   * implement Bill's half table idea and benchmark it