1323
Comment:
|
2716
|
Deletions are marked like this. | Additions are marked like this. |
Line 14: | Line 14: |
* Classical algorithm is embarrassingly parallel - bad if you have an n log n algorithm in that range | |
Line 30: | Line 31: |
Other Options: * AMD Math library AML provides BLAS interface uses GPU - but that's for linear algebra * PTX NVIDIA GPU assembler code for inner loops Gonzalo Tornaria (theta functions expert) * Is there a way to encode integer multiplication in linear algebra? (A. Perhaps vectors - multimodular, but not matrices) * Kernel * Launch threads - issues based on hierarchy of memory - CPU registers-> memory per processor block-> main graphics memory-> system memory * Can launch all the threads on all cpus in a couple of cycles * How GPU would compare to carefuly programmed FPGA? * E.g a Stratix IV can have around 1000 18x18 multipliers, but maybe that's not too much, and this is probably very expensive hardware * http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp * Carl Witty does FPGA programming - says it is probably very expensive * Accoding to the spec the stratix can have parallel high-bandwith communication Jeff Gilchrist * What about ATI hardware - why not support OpenCL? * Carl Witty says - Bill Hart * Cell port will happen as it is funded by EPSRC Grant - will be proof of principle code to apply for a port to Cell2Xi Glenn Tarbox * Flame - for overall integration of libraries |
MPIR - Parallel Algorithms and CUDA
Present : Carl Witty, Bill Hart, Michael Abshoff, Glenn Tarbox Virtually Present : Jeff Gilchrist, Gonzalo Tornaria
You can chat in a Linux text console by installing "irssi" and running: "irssi -c irc.freenode.net" and then type "/join #sage-devel"
Parallel algorithms:
- Multimodular algorithms
- Scalar algorithms
Peter Montgomery's remainder algorithm a mod b, precompute b1 = B mod b, b2 = B2 mod b, b3 = B3 mod b, then write a = a0 + a1*B + a2*B^2 +..., then compute a0 + a1*b1 + a2*b2 +.... and do final reduction mod b. Multiplications can be done in parallel.
- Addition and subtraction can be parallelised using nails - non-unique representation of numbers
- Classical algorithm is embarrassingly parallel - bad if you have an n log n algorithm in that range
Glenn Tarbox (Owner of cuda1, AMD K10 with NVIDA CUDA card - expert on large scale parallelisation)
- What are the top level integration issues, e.g. by libraries using MPIR
Michael Abshoff (Sage release manager)
- Link into Sage via cython and link in CUDA
CUDA documentation:
CUDA issues:
- Memory bandwidth limits algorithms - matrices n**2 entries to get in and out, matrix multiplication O(n**2.7), but for integers n limbs to get in and out O(n log n log log n) operations to multiply
Other Options:
- AMD Math library AML provides BLAS interface uses GPU - but that's for linear algebra
- PTX NVIDIA GPU assembler code for inner loops
Gonzalo Tornaria (theta functions expert)
- Is there a way to encode integer multiplication in linear algebra? (A. Perhaps vectors - multimodular, but not matrices)
- Kernel
Launch threads - issues based on hierarchy of memory - CPU registers-> memory per processor block-> main graphics memory-> system memory
- Can launch all the threads on all cpus in a couple of cycles
- How GPU would compare to carefuly programmed FPGA?
- E.g a Stratix IV can have around 1000 18x18 multipliers, but maybe that's not too much, and this is probably very expensive hardware
http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp
- Carl Witty does FPGA programming - says it is probably very expensive
- Accoding to the spec the stratix can have parallel high-bandwith communication
Jeff Gilchrist
- What about ATI hardware - why not support OpenCL?
- Carl Witty says -
Bill Hart
Cell port will happen as it is funded by EPSRC Grant - will be proof of principle code to apply for a port to Cell2Xi
Glenn Tarbox
- Flame - for overall integration of libraries