The Luna Cluster
For quick reference, download the LSF Cheat Sheet. For futher assistance, email Nicholas Socci: soccin@mskcc.org.
Directories
/home/$USERNAME – 100GB Quota, backed-up (mirrored)
- Use home for scripts/programs/workfiles
- Not for intermediate files or results
/ifs/work/$LABNAME/$USERNAME – 2Tb Quota (can request more)
- Fast disk
- Use for scratch/intermediate files
/ifs/res/$LABNAME/$USERNAME
- Medium performance disk. Use for intermediate term storage of results
/opt/common
- binaries organized by OS/PROGRAM/VERSION
- E.g.: PERL
- /opt/common/CentOS6/perl/
- perl–5.16.3
- perl–5.20.1
- perl–5.20.2
- /opt/common/CentOS6/perl/
LSF Commands
Before you can run LSF you need to source the following file:
/common/lsf/conf/profile.lsf
Add to .profile or .bashrc or …
Simple command:
bsub sleep 30
To request N CPU’s use:
-n N
For memory:
-R "rusage[mem=GB]"
-R "rusage[mem=10]"
Requests 10Gb
If you job will run longer than 1 hour please specific the job time with:
-We HOURS:MINUTES
ie for 2 hours jobs
-We 2:00
or
-We 120
To send output to a file you need to use:
-o out.txt
or
-o DIR/
to write to a directory.
- bsub -I interactive
- bsub -J NAME – name the job
- bsub -w “NAME” – wait on the job
- bsub -w $jobid – wait on job number
- bsub -o filename – redirect stdout to the file filename.
- bsub -e filename – redirect stderr to the file filename
The current defaults for jobs is:
-R "rusage[mem=1]" \
-R "span[hosts=1]"
-R "rusage[iounits=1]"
Quotes are necessary.
Notes
- Do not submit jobs from /ifs/data and do not source data from /ifs/data. This is only mounted on the head nodes, and not write any result anywhere.
- Auto emailing is turned off, need to specify in bsub command
- ‘We’ is expected runtime, anything less than 60 is considered short job, which can go on all the nodes, long jobs (not short jobs) can take up 20% of all s nodes, and half of t nodes
- Stdout will go to -o file. If you want to redirect, you must add quotes around execution command in bsub command. example:
bsub -We 1 -J jobName -o output_file.txt "ls -al 1> redirect_file.txt"
- Hold option in bsub:
-w "post_done($PREV_JOBNAME)"
Use post_done for holding, not “done” (done starts too quick sometimes).
If holding for multiple jobs with close to same name -w “post_done($prev_Job*)” will work, unless you have one This will ONLY let the job run if $PREV_JOBNAME job completed with exit status 0 AND finished post done processes (not sure what that is).
More Notes
- Kill all my jobs:
bkill -b -u $USERNAME 0
- bkill -b is much better to use if you are killing multiple jobs. Otherwise it takes a lot longer, potential to crash system.
- bjobs gives you running jobs and jobs finished within 3 days (using bjobs -a)
- bjobs -w shows full job names
Read more to search through longer jobs use:
bhist -n N -a -J "JOBNAME""
-n is how many events files to go through (jobs are rotated to older events files ~ every day),
-a means old jobs,
-J is job name (can contain wildcard).
-J is job name (can contain wildcard).
bresources -g -l shows what guaranteed resources are being used
bresources -g -l shows what guaranteed resources are being used
brequeue -e jobID if there is a job died that you want to try again
bmod -wn jobID removes the wait dependencies from a job so it will run.
The Luna Cluster is made up of:
- luna.cbio.mskcc.org (login host): HP DL380 Gen8, one Xeon E5–2650 v2 @ 2.60GHz, 64gb RAM
- 62 Compute nodes, 1024 cores total (2048 threads total)
- u01-u36: 36 HP ProLiant DL160 Gen9, dual 8-core Xeon E5–2640 v3 @ 2.60GHz , 256gb RAM per node
- s01-s24: 24 HP ProLiant DL160 Gen8, dual 8 core Xeon E5–2660 0 @ 2.20GHz , 384gb RAM per node
- t01-t02: 2 HP ProLiant DL580 Gen8, quad 8 core Xeon E7–4820 v2 @ 2.00GHz, 1.5tb RAM per node
nodes have 800GB at /scratch/$USER
- SolISI (isilon array) 1.5 – 2 PBytes (NL and X)
Luna is the head node for submitting jobs to the cluster
Some nodes have internet. We will describe how to access these nodes in the future.