Differences between revisions 2 and 9 (spanning 7 versions)
Revision 2 as of 2009-10-12 11:13:28
Size: 4081
Editor: Minh Nguyen
Comment:
Revision 9 as of 2010-09-15 14:12:16
Size: 7568
Editor: Minh Nguyen
Comment: add section "Backup your data" to Sage cluster guidelines
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from SageClusterGuidelines
Line 3: Line 4:
This document sets out guidelines for using the Sage cluster. The Sage cluster consists of four similar computers: This document sets out guidelines for using the Sage cluster. These guidelines exist so that users are aware of the available computing resources on the Sage cluster. The Sage cluster comprises a set of five similar computers and one Solaris machine:
Line 7: Line 8:
 * `geom.math` --- mainly for geometry research  * `geom.math` --- mainly for algebraic geometry research
Line 11: Line 12:
 * `rh.math` --- a machine from Seattle University
Line 12: Line 15:

 * `t2.math` --- a SPARC Solaris machine for making software of all kind work on Solaris
Line 22: Line 27:
 * Does your job relate to geometry computation? If so, then `geom.math` is the machine to use, since that is its intended purpose.  * Does your job relate to algebraic geometry computation? If so, then `geom.math` is the machine to use, since that is its intended purpose.
Line 26: Line 31:
The machines `mod.math` and `geom.math` can be used for running very long jobs. Running long jobs on any of those machines would minimize disruption to your long jobs because release managers don't usually compile, run and doctest Sage on any of those machines, unless absolutely necessary. Running long jobs on `sage.math` would result in disruption to long running jobs because many people actually use `sage.math` to compile, run and doctest Sage. Doctesting Sage is usually performed in parallel, which can take away computing time from other running jobs. The machines `mod.math` and `geom.math` can be used for running very long jobs. Running long jobs on any of those machines would minimize disruption to your long jobs because release managers don't usually compile, run and doctest Sage on any of those machines, unless absolutely necessary. Many people actually use `sage.math` to compile, run and doctest Sage. Because building and doctesting Sage is very parallelizable, it is useful (and common practice) to use a significant portion of the machine's resources for a relatively short amount of time, as opposed to a small amount of resources for a large amount of time.
Line 29: Line 34:

If you have a long-running job on any compute node, please make sure that the job has a nice value of 19. You can use the command `top` to determine the process ID or PID of a running job. Once you have determined a running job's PID, use the command `renice` to change the job's nice value to as high a nice value as possible, usually 19. For example, the following changes the nice value of three running jobs to 19:

{{{
renice +19 k m n
}}}

where `k`, `m` and `n` are PIDs of three running jobs.
Line 35: Line 48:
From any of the machines on the Sage cluster, you can ssh to any of the other three machines. Whenever ssh'ing to another server, you should use the syntax From any of the machines on the Sage cluster, you can ssh to any of the other three machines. Whenever ssh'ing to another server, you could use the syntax
Line 43: Line 56:
 * With the option "`-C`", all data transferred between your local machine and the remote server are compressed. The option to compress data comes in really handy if you have a limited Internet quota.  * With the option "`-C`", all data transferred between your local machine and the remote server are compressed. The option to compress data comes in really handy if you have a limited Internet quota. This is not an issue ssh'ing between the machines in the cluster.
Line 45: Line 58:
 * The option "`-x`" disables X11 forwarding. If you don't want to transfer X11 graphical data between machines, you should explicitly disable X11 forwarding. With a text-based SSH session, X11 forwarding involves transferring more data than necessary.  * The option "`-x`" disables X11 forwarding. If you don't want to transfer X11 graphical data between machines, you should explicitly disable X11 forwarding. With a text-based SSH session, X11 forwarding involves transferring more data than necessary. Again, this is not an issue ssh'ing between the machines in the cluster.
Line 48: Line 61:


== Minimizing disruption and downtime ==

The partition `/home` is shared via NFS with the following machines:

 * `boxen.math`
 
 * `geom.math`
 
 * `mod.math`
 
 * `rh.math`
 
 * `sage.math`
 
 * `t2.math`

which means that a running job shouldn't do a lot of disk I/O under `/home`. Instead, you should use a fast-ish local partition such as `/scratch` on `sage.math` when logged in to `sage.math`. However, `/scratch` is mounted as a local disk on `sage.math` and shared with other compute nodes via NFS. A significant implication is that you should avoid doing a lot of disk I/O under `/scratch` when logged in to machines other than `sage.math`.

At least in version 4.2.1, Sage uses a temporary directory under your `/home` directory, so it's also a good idea to set the environment variable `DOT_SAGE` to a local partition or directory such as `/tmp`. You could do that as follows:

{{{
export DOT_SAGE=/tmp/<your-username>/
}}}

where you could replace `<your-username>` with the username you use to
login to compute nodes on the Sage cluster.

The directory `/dev/shm` is a fast-ish local ramdisk you could use for jobs that require disk I/O. On `geom.math`, the local directory `/space` serves a similar purpose to `/scratch` on `sage.math`. You could create your own working directory whose name reflects your username, and then proceed to use that directory as your scratch directory. For example, you could compile Sage under that working directory and have a running job do disk I/O under it. However, note that after a reboot of a compute node, any data under `/dev/shm` would be lost.


== Backup your data ==

You need to have a backup plan for data you store on the Sage cluster. For example, you could have data under your home directory under `/home/` and backup that data under `/scratch/`. You might consider creating a backup directory under `/disk/scratch/`, where the name of that backup directory reflects your username. Then you could backup your data under, say, `/disk/scratch/<username>`.

You should also consider backing up your data on machines outside the Sage cluster. For example, you could create accounts on any of the following hosting sites:

 * [[http://bitbucket.org | bitbucket]]
 * [[http://github.com | GitHub]]
 * [[http://gitorious.org | Gitorious]]
 * [[http://code.google.com | Google Code]]

and backup your data and code there.

Guidelines for Using the Sage Cluster

This document sets out guidelines for using the Sage cluster. These guidelines exist so that users are aware of the available computing resources on the Sage cluster. The Sage cluster comprises a set of five similar computers and one Solaris machine:

  • boxen.math --- mainly for virtual machines and web services

  • geom.math --- mainly for algebraic geometry research

  • mod.math --- mainly for number theory research

  • rh.math --- a machine from Seattle University

  • sage.math --- mainly for Sage development

  • t2.math --- a SPARC Solaris machine for making software of all kind work on Solaris

Prioritizing usage

The machine sage.math is primarily for Sage development. Ideally, you should use that machine to develop code, upgrade/update packages, porting packages/code, reviewing/working on tickets, etc. If you have a long job to run on the Sage cluster, first consider whether your job is related to any of these goals.

Some questions relating to using any of the above machines include:

  • If the job would take days or weeks or longer, does it relate to number-theoretic computation? If so, then mod.math is the machine to use as its stated purpose is for number theory research, which also includes number-theoretic computation.

  • Does your job relate to algebraic geometry computation? If so, then geom.math is the machine to use, since that is its intended purpose.

Most of the time, you shouldn't run long jobs on boxen.math because that machine is for web services. We want to minimize the downtime of the public notebook server, the Sage wiki server, the trac bug server, the Sage main website, and websites of other projects hosted on boxen.math. Please first consider using geom.math or mod.math before running long jobs on sage.math.

The machines mod.math and geom.math can be used for running very long jobs. Running long jobs on any of those machines would minimize disruption to your long jobs because release managers don't usually compile, run and doctest Sage on any of those machines, unless absolutely necessary. Many people actually use sage.math to compile, run and doctest Sage. Because building and doctesting Sage is very parallelizable, it is useful (and common practice) to use a significant portion of the machine's resources for a relatively short amount of time, as opposed to a small amount of resources for a large amount of time.

Running a long job on the machine sage.math --- where the job can take days, weeks, or months --- can significantly affect the development, compilation, and doctesting of the Sage library. When you work on a ticket, whether that be developing code or reviewing other people's code, you can use sage.math to parallel doctest the Sage library with that new code using 6 to 10 threads. This should significantly reduce the development and doctesting time from about 3 to 6 hours with one thread, to about 30 minutes with 16 threads.

If you have a long-running job on any compute node, please make sure that the job has a nice value of 19. You can use the command top to determine the process ID or PID of a running job. Once you have determined a running job's PID, use the command renice to change the job's nice value to as high a nice value as possible, usually 19. For example, the following changes the nice value of three running jobs to 19:

renice +19 k m n

where k, m and n are PIDs of three running jobs.

The sooner that tickets and code get merged in Sage, the sooner that users get to use new code and be grateful to developers, patch authors and reviewers for providing useful software. So before running any long jobs on sage.math, please consider whether a job can be run on any of the other machines instead.

From any of the machines on the Sage cluster, you can ssh to any of the other three machines. Whenever ssh'ing to another server, you could use the syntax

ssh -C -x -a <remote-machine>

Here's an explanation of these options:

  • With the option "-C", all data transferred between your local machine and the remote server are compressed. The option to compress data comes in really handy if you have a limited Internet quota. This is not an issue ssh'ing between the machines in the cluster.

  • The option "-x" disables X11 forwarding. If you don't want to transfer X11 graphical data between machines, you should explicitly disable X11 forwarding. With a text-based SSH session, X11 forwarding involves transferring more data than necessary. Again, this is not an issue ssh'ing between the machines in the cluster.

  • The option "-a" disables the forwarding of the authentication agent. If you ssh from one server to another server, agent forwarding isn't good for security reasons. An attacker who has compromized the second server can then work back to the first server and get access to two servers just because you forwarded the authentication agent.

Minimizing disruption and downtime

The partition /home is shared via NFS with the following machines:

  • boxen.math

  • geom.math

  • mod.math

  • rh.math

  • sage.math

  • t2.math

which means that a running job shouldn't do a lot of disk I/O under /home. Instead, you should use a fast-ish local partition such as /scratch on sage.math when logged in to sage.math. However, /scratch is mounted as a local disk on sage.math and shared with other compute nodes via NFS. A significant implication is that you should avoid doing a lot of disk I/O under /scratch when logged in to machines other than sage.math.

At least in version 4.2.1, Sage uses a temporary directory under your /home directory, so it's also a good idea to set the environment variable DOT_SAGE to a local partition or directory such as /tmp. You could do that as follows:

export DOT_SAGE=/tmp/<your-username>/

where you could replace <your-username> with the username you use to login to compute nodes on the Sage cluster.

The directory /dev/shm is a fast-ish local ramdisk you could use for jobs that require disk I/O. On geom.math, the local directory /space serves a similar purpose to /scratch on sage.math. You could create your own working directory whose name reflects your username, and then proceed to use that directory as your scratch directory. For example, you could compile Sage under that working directory and have a running job do disk I/O under it. However, note that after a reboot of a compute node, any data under /dev/shm would be lost.

Backup your data

You need to have a backup plan for data you store on the Sage cluster. For example, you could have data under your home directory under /home/ and backup that data under /scratch/. You might consider creating a backup directory under /disk/scratch/, where the name of that backup directory reflects your username. Then you could backup your data under, say, /disk/scratch/<username>.

You should also consider backing up your data on machines outside the Sage cluster. For example, you could create accounts on any of the following hosting sites:

and backup your data and code there.