Diff for "CUDA" - Sagemath Wiki

Differences between revisions 10 and 19 (spanning 9 versions)

MPIR - Parallel Algorithms and CUDA

Present : Carl Witty, Bill Hart, Michael Abshoff, Glenn Tarbox Virtually Present : Jeff Gilchrist, Gonzalo Tornaria

You can chat in a Linux text console by installing "irssi" and running: "irssi -c irc.freenode.net" and then type "/join #sage-devel"

Parallel algorithms:

Multimodular algorithms
Scalar algorithms
Peter Montgomery's remainder algorithm a mod b, precompute b1 = B mod b, b2 = B^{2 mod b, b3 = B}3 mod b, then write a = a0 + a1*B + a2*B^2 +..., then compute a0 + a1*b1 + a2*b2 +.... and do final reduction mod b. Multiplications can be done in parallel.
Addition and subtraction can be parallelised using nails - non-unique representation of numbers
Classical algorithm is embarrassingly parallel - bad if you have an n log n algorithm in that range

Glenn Tarbox (Owner of cuda1, AMD K10 with NVIDA CUDA card - expert on large scale parallelisation)

What are the top level integration issues, e.g. by libraries using MPIR

Michael Abshoff (Sage release manager)

Link into Sage via cython and link in CUDA

CUDA documentation:

NVIDA website

CUDA issues:

Memory bandwidth limits algorithms - matrices n**2 entries to get in and out, matrix multiplication O(n**2.7), but for integers n limbs to get in and out O(n log n log log n) operations to multiply

Other Options:

AMD Math library AML provides BLAS interface uses GPU - but that's for linear algebra
PTX NVIDIA GPU assembler code for inner loops

Gonzalo Tornaria (theta functions expert)

Is there a way to encode integer multiplication in linear algebra? (A. Perhaps vectors - multimodular, but not matrices)
Kernel
Launch threads - issues based on hierarchy of memory - CPU registers-> memory per processor block-> main graphics memory-> system memory
Can launch all the threads on all cpus in a couple of cycles
How GPU would compare to carefuly programmed FPGA?
E.g a Stratix IV can have around 1000 18x18 multipliers, but maybe that's not too much, and this is probably very expensive hardware
http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp
Carl Witty does FPGA programming - says it is probably very expensive
Accoding to the spec the stratix can have parallel high-bandwith communication

Jeff Gilchrist

What about ATI hardware - why not support OpenCL?
Carl Witty says -

Bill Hart

Cell port will happen as it is funded by EPSRC Grant - will be proof of principle code to apply for a port to Cell2Xi

Glenn Tarbox

Flame - for overall integration of libraries

-  ⇤ ← Revision 10 as of 2009-05-17 22:36:59 → 
  Size: 1323
  Editor: WilliamHart
  Comment:
+   ← Revision 19 as of 2009-05-17 23:07:11 → ⇥
  Size: 2716
  Editor: WilliamHart
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 14:
+ * Classical algorithm is embarrassingly parallel - bad if you have an n log n algorithm in that range
-Line 30:
+Line 31:
+Other Options:

 * AMD Math library AML provides BLAS interface uses GPU - but that's for linear algebra
 * PTX NVIDIA GPU assembler code for inner loops

Gonzalo Tornaria (theta functions expert)

 * Is there a way to encode integer multiplication in linear algebra? (A. Perhaps vectors - multimodular, but not matrices)

 * Kernel
 * Launch threads - issues based on hierarchy of memory - CPU registers-> memory per processor block-> main graphics memory-> system memory
 * Can launch all the threads on all cpus in a couple of cycles

 * How GPU would compare to carefuly programmed FPGA?
 * E.g a Stratix IV can have around 1000 18x18 multipliers, but maybe that's not too much, and this is probably very expensive hardware
 * http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp
 * Carl Witty does FPGA programming - says it is probably very expensive
 * Accoding to the spec the stratix can have parallel high-bandwith communication

Jeff Gilchrist
 
 * What about ATI hardware - why not support OpenCL?
 * Carl Witty says - 

Bill Hart

 * Cell port will happen as it is funded by EPSRC Grant - will be proof of principle code to apply for a port to Cell2Xi

Glenn Tarbox 

 * Flame - for overall integration of libraries