Processing Math: Done
No jsMath TeX fonts found -- using unicode fonts instead.
This may be slow and might not print well.
Use the jsMath control panel to get additional information.
jsMath Control PanelHide this Message


jsMath
Differences between revisions 18 and 19
Revision 18 as of 2009-05-17 22:57:32
Size: 2330
Editor: WilliamHart
Comment:
Revision 19 as of 2009-05-17 23:07:11
Size: 2716
Editor: WilliamHart
Comment:
Deletions are marked like this. Additions are marked like this.
Line 49: Line 49:
 * Accoding to the spec the stratix can have parallel high-bandwith communication

Jeff Gilchrist
 
 * What about ATI hardware - why not support OpenCL?
 * Carl Witty says -

Bill Hart

 * Cell port will happen as it is funded by EPSRC Grant - will be proof of principle code to apply for a port to Cell2Xi

Glenn Tarbox

 * Flame - for overall integration of libraries

MPIR - Parallel Algorithms and CUDA

Present : Carl Witty, Bill Hart, Michael Abshoff, Glenn Tarbox Virtually Present : Jeff Gilchrist, Gonzalo Tornaria

You can chat in a Linux text console by installing "irssi" and running: "irssi -c irc.freenode.net" and then type "/join #sage-devel"

Parallel algorithms:

  • Multimodular algorithms
  • Scalar algorithms
  • Peter Montgomery's remainder algorithm a mod b, precompute b1 = B mod b, b2 = B2 mod b, b3 = B3 mod b, then write a = a0 + a1*B + a2*B^2 +..., then compute a0 + a1*b1 + a2*b2 +.... and do final reduction mod b. Multiplications can be done in parallel.

  • Addition and subtraction can be parallelised using nails - non-unique representation of numbers
  • Classical algorithm is embarrassingly parallel - bad if you have an n log n algorithm in that range

Glenn Tarbox (Owner of cuda1, AMD K10 with NVIDA CUDA card - expert on large scale parallelisation)

  • What are the top level integration issues, e.g. by libraries using MPIR

Michael Abshoff (Sage release manager)

  • Link into Sage via cython and link in CUDA

CUDA documentation:

CUDA issues:

  • Memory bandwidth limits algorithms - matrices n**2 entries to get in and out, matrix multiplication O(n**2.7), but for integers n limbs to get in and out O(n log n log log n) operations to multiply

Other Options:

  • AMD Math library AML provides BLAS interface uses GPU - but that's for linear algebra
  • PTX NVIDIA GPU assembler code for inner loops

Gonzalo Tornaria (theta functions expert)

  • Is there a way to encode integer multiplication in linear algebra? (A. Perhaps vectors - multimodular, but not matrices)
  • Kernel
  • Launch threads - issues based on hierarchy of memory - CPU registers-> memory per processor block-> main graphics memory-> system memory

  • Can launch all the threads on all cpus in a couple of cycles
  • How GPU would compare to carefuly programmed FPGA?
  • E.g a Stratix IV can have around 1000 18x18 multipliers, but maybe that's not too much, and this is probably very expensive hardware
  • http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp

  • Carl Witty does FPGA programming - says it is probably very expensive
  • Accoding to the spec the stratix can have parallel high-bandwith communication

Jeff Gilchrist

  • What about ATI hardware - why not support OpenCL?
  • Carl Witty says -

Bill Hart

  • Cell port will happen as it is funded by EPSRC Grant - will be proof of principle code to apply for a port to Cell2Xi

Glenn Tarbox

  • Flame - for overall integration of libraries

CUDA (last edited 2009-05-17 23:53:03 by WilliamHart)