Differences between revisions 50 and 51
Revision 50 as of 2011-03-05 19:10:35
Size: 12886
Editor: jason
Comment:
Revision 51 as of 2011-03-05 19:11:28
Size: 12768
Editor: jason
Comment:
Deletions are marked like this. Additions are marked like this.
Line 94: Line 94:
    * Make the new_stream functions recognize any files created and automatically make new streams for each file (this may involve copying the files to a temporary directory so they can be inserted into the database). This involves:     * Make the device recognize any files created and automatically make new streams for each file. This involves:

Drake Sage Group

This page documents activities of the Drake University Sage group. We regularly meet on Fridays, 11:30-12:20, in Howard 111 (at Drake University, Des Moines, Iowa, USA)

Our initial work is on a single-cell compute server, which basically is a webpage that can execute an arbitrary block of Sage code.

For more information, please contact Jason Grout at jason#[email protected] (replace the # with a .)

Getting Started

  • install Sage or Python, ipython, mercurial

  • install mongodb

  • install PyMongo and Flask python modules:

    # from within python
    from setuptools.command import easy_install
    easy_install.main(["flask"])
    easy_install.main(["pymongo"])
  • configure mercurial: put this in your ~/.hgrc file

    [ui]
    username = YOUR NAME <YOUR EMAIL>
    
    [extensions]
    record=
    convert=
    hgext.mq=
    hgext.extdiff=
    hgk=
    transplant=
    fetch=
  • Create a Google code account

  • clone the simple-db-compute repository (either just clone it locally, or clone it on google code and then pull from your clone)

Regular Meetings

03 Feb 2010

Meet in Howard Hall 308 at 2pm (room reserved from 1:30-3, so come early if you want).

  • Introductions
  • What Sage is
  • Overview of the simple-db-compute project and its architecture
  • Resources
  • Installfest--get the simple compute server up and running on as many people's computers as possible
  • First goal of project
    • familiarize yourself with the simple-db-compute source code
    • add a "compute id" that is returned to the user. The answers page then queries for just that computation's result.
    • add necessary files to get this running on Windows (for example, a .bat file to start mongodb)
    • Look at making the device more parallel/scalable. See multiprocessing (which includes functionality for pools of worker processes), or maybe use the parallel code from Sage. The new experiments in forking Sage to start it up also seem relevant.

  • Next meeting time

11 Feb 2011

  • Status reports
  • Get people set up who weren't able to get everything installed and running last time
  • Further discuss the design

18 Feb 2011

  • Status reports
  • Further discuss the design

25 Feb 2011

  • Status reports
  • Further discuss the design
  • Follow up with people going to Sage Days

04 Mar 2011

  • Status reports
    • Code sprint last week
  • Discuss design of the FileStore class

Group Code Sprints

Tue, 01 Mar 2011 (2pm, Howard 235)

  • Done: Implement a simple streams functionality that lets a person do something like:

print "hi"
new_stream('text')
print "bye"

and the "hi" and "bye" appear in separate "pre" html blocks.

Mon, 07 Mar 2011 (3:30pm, Howard 235)

  • Get the FileStore class working

    • Make the device recognize any files created and automatically make new streams for each file. This involves:
      • Changing the worker processes to execute the block of code in a temporary directory, then have the device look in that directory for files that were created
      • Making the device store the created files in a filestore, as well as generate a stream object for the list of files. If a file was referenced in another stream object (maybe a plot?), then maybe the file shouldn't be added to this stream object, or maybe there should be two stream objects---one for files that were referenced in another stream, and one for files that weren't referenced in another stream.
    • Make flask be able to create a URL resource for each file (following the format http://localhost:5000/files/<cell_id>/<filename>, maybe), which fetches the file from the filestore and sends it to the browser when needed. Again, see design walkthrough in Notebook design.

    • Make the SQLAlchemy filestore (defaulting to a sqlite files.db database which stores the files as BLOBs)

Projects

Here are some project ideas for simple-db-compute, along with some hopefully helpful pointers to resources.

  • Look at running flask and uwsgi in async mode to do long-polling (i.e., comet) instead of the current normal polling. See http://projects.unbit.it/uwsgi/wiki/AsyncSupport

  • Port the sqlite backend to actually use sqlalchemy so that it can also talk with mysql and postgresql, etc.
  • Figure out what needs to happen to get this all working on Windows and write up a "Getting started to developing simple-db-compute on Windows" page. For example, make a start_mongo.bat file or something.
  • Make the output be dynamically updated, rather than only updated at the end of the output. To do this, the device needs to be continuously updating the database with the output and the AJAX part needs to be downloading parts of the output. Maybe make the AJAX query tell Flask how much output has already been sent, and then flask sends back output that hasn't already been sent. See the design in Notebook design.

  • Try using callbacks in the device process rather than 0-time polling. Are there race conditions with callbacks?
  • File Stores
    • Make the new_stream functions recognize any files created and automatically make new streams for each file (this may involve copying the files to a temporary directory so they can be inserted into the database). This involves:
      • Changing the worker processes to execute the block of code in a temporary directory, then have the device look in that directory for files that were created
      • Making the device store the created files in a filestore, as well as generate a stream object for the list of files. If a file was referenced in another stream object (maybe a plot?), then maybe the file shouldn't be added to this stream object, or maybe there should be two stream objects---one for files that were referenced in another stream, and one for files that weren't referenced in another stream.
    • Make flask be able to create a URL resource for each file (following the format http://localhost:5000/files/<cell_id>/<filename>, maybe), which fetches the file from the filestore and sends it to the browser when needed. Again, see design walkthrough in Notebook design.

    • Make the SQLAlchemy filestore (defaulting to a sqlite files.db database which stores the files as BLOBs)
    • Make a filesystem filestore, which stores the files on the filesystem so that nginx can serve them up directly without having to go through flask. This will likely be much faster than going through flask and a FileStore object and database.

  • Look at using Celery for managing device workers.

  • In-Progress (Ira working on Tsung, Jason working on FunkLoad): Write a test script that will hammer the site hard to test how scalable it is. Maybe the python multiprocessing module could be used to make a number of workers in a pool and each worker submit computations to the website at a configurable rate. Record the time for roundtrip in a request and the time for a computation to appear. See if the site scales to hundreds of requests each minute or so. Maybe Tsung or one of the tools here is a good way to do this, rather than writing our own. With the current code (revision a1b94e690b8beb1c41ee27ec1b8e3cb97afa5170), a stress-testing program would need to do two things:

    • Make an HTTP GET request to URL/eval?commands=3%2B2%0A (where commands is an URL-escaped string of the text. This text is "3+2\n". The response will be a JSON dictionary that looks like this:  { "computation_id": "...."}  (where .... is a computation id)

    • Wait some small amount of time and repeatedly make HTTP GET requests to URL/output_poll?computation_id=.... (where the computation id was obtained in the previous step). This call will either return a JSON output result:  { "output": "2\n" } , or an empty JSON dictionary:  {} . If it returns the empty JSON dictionary, then wait a small amount of time (maybe 1 second) and query again. Repeat until an output comes back.

    • Record response times to each query, as well as response times from the time a computation id is assigned to the time you get a result.
    • Then fire up lots of instances of the program, or even better, use something like the python multiprocessing, to fire up lots of these.
    • Interpret the response times to tell us cool things about how these things scale.
  • Start reading/experimenting with apache/nginx/lighttpd/gunicorn (using gevent) to serve up the simple-db-compute.
  • Web frontend should handle these types of streams
    • [X] Text: {'type'='text', 'content'='...'}: put the content in a <pre> element

    • [X] Image: {'type'='image', 'file'='filename.png'}: Make an <img src='filename.png' /> element

    • [ ] Html: {'type'='html', 'content'='...'}: put the content (as html) in a div element.

Projects that are done

These projects are done. They may still be able to be improved, though.

  • DONE (Jason, Ira, Ryan): Output streams---Write a Python library to make output "streams" which could represent different objects. For example, one stream could be a stdout (text), while another stream could be html code. The workers can call the functions to make a new stream of a specific type. The function inserts into the stdout some marker indicating a new stream is starting. The device recognizes that marker and inserts the stream information into the database. The web front end also recognizes the streams and has special code to handle each type (for example, text streams are put inside of a <pre>, while html streams are just added to the document, maybe inside of a div. For a longer explanation of this, see the Notebook design page.

  • DONE: Make flask assign a random computation id to each incoming request, return that computation id to the browser as a return value in the ajax call. Then make the browser keep asking the server for the results of that particular computation id. When results exist, send the result back to the browser. Otherwise, send back some message that the results are not computed yet.
  • DONE: Make the web interface use AJAX to send a computation and display a result. Helpful resources: Jquery AJAX, google for numerous jquery ajax tutorials. You will probably want to create a javascript file in the static/ directory and add it to templates/base.html (follow the examples already there adding jquery, for example). I would suggest using something like JSON to send and receive messages with the server. Maybe using long polling (see https://github.com/RobertFischer/JQuery-PeriodicalUpdater/ for example)

  • DONE (Ira): Make the AJAX and computation ID stuff work in sqlite (it works in mongodb right now).
  • DONE: Configure apache, nginx, or lighttpd to serve up simple-db-compute. Test its scalability compared to the default really simple python http server. It looks like nginx with uwsgi might be a interesting option to explore.
  • DONE: On the backend, write a device that keeps a pool of workers (maybe using the python multiprocessing library) and keeps those workers busy with computations from the database. Ideally, the polling of the database should not be blocked by worker computations. Instead, on each poll, workers that are finished should have output put into the database, and new computations should be pulled out of the database. It seems like a good idea to avoid trying to put each output in the database as it happens. Rather, batch up the database updates to happen once in a polling interval.

DrakeSageGroup (last edited 2011-04-28 12:00:59 by jason)