There are a couple of options on how to access the HPC cluster. The first method, and the required method for first-time access, is to SSH into submit.hpc.cosine.oregonstate.edu using your cosine/science username and password. When you SSH into the cluster for the first time, a process is initiated to create and set up your home folder. The second option, after you have successfully logged into the HPC cluster with SSH, is to access the HPC cluster is using the web interface.
https://ondemand.science.oregonstate.edu
If connecting from Off campus, please refer to https://oregonstate.teamdynamix.com/TDClient/1935/Portal/KB/?CategoryID=6889
Join the Cosine HPC mailing list for notifications about software updates and maintenance. Visit http://lists.science.oregonstate.edu/mailman/listinfo/cosine-hpc
The command line smclient can be used to access Windows shares easily. Using smbclient a remote Windows shares can be listed, uploaded, deleted, or navigated easily. The smbclient command also provides an interactive shell very similar to FTP or SFTP.
For example, connecting to your Z:drive:
smbclient //pfs.cosine.oregonstate.edu/{ONID} -U {ONID}
To list all shares:
smbclient -L {file_server_name}
For interactive shell:
smbclient //{server_name}/{share} -U {username/ONID}
The remote share can be connected like FTP/SFTP and a new shell is provided via the smbclient. This shell can be used to navigate, list, upload, download, etc. files.
To list files/folders on the share:
smb: \> ls
To list files/folders on the local system:
smb: \> !ls
To change directory on the share:
smb: \> cd {folder_name}
To change directory on the local system:
smb: \> lcd {folder_name}
To download a file from the share:
smb: \> get {file_name}
To upload a file to the share:
smb: \> put {file_name}
Files and folders can be uploaded with the mput command. But in order to upload specified folder and its content the recursive mode should be enabled with the recurse command. Also the upload can be started with the mput command.
smb: \> recurse
smb: \> mput {folder_name}
Files and folders can be also downloaded with the mget command. If there are multiple files and folders to download the recursive mode should be enabled with the recurse command.
smb: \> recurse
smb: \> mget {folder_name}
To end your session type "quite" at the interactive shell prompt:
smb: \> quite
Yes you can.
Example
mkdir ~/my_zdrive
sshfs shell.cosine.oregonstate.edu: ~/my_zdrive
NOTE: Be sure to umount your Z:drive after you are finished with the following command:
fusermount -u ~/my_zdrive
The cluster uses the environments module to provide an easy way to switch between software revisions. These modules configure environmental variables such as PATH for each piece of software
To get a list of available modules to load, execute:
module avail
To get a list of what modules are currently loaded, execute:
module list
To load the Matlab 2014b module, execute:
module load matlab/R2014b
To display modules set to load during login:
module initlist
To set a module to automatically load during login:
module initadd matlab/R2014b
To remove a module from loading during login:
module initrm matlab/R2014b
Jobs should be submitted using a special sh script which tells the scheduler how to handle the job.
An example with common options can be seen below:
submit.sh
#!/bin/sh # Give the job a name #$ -N example_job # set the shell #$ -S /bin/sh # set working directory on all host to # directory where the job was started #$ -cwd # send all process STDOUT (fd 2) to this file #$ -o job_output.txt # send all process STDERR (fd 3) to this file #$ -e job_output.err # email information #$ -m e # Just change the email address. You will be emailed when the job has finished. #$ -M [email protected] # generic parallel environment with 2 cores requested #$ -pe orte 2 # Load a module, if needed module load sprng/5 # Commands ./my_program
Insure your script or program is executable by running the following command
chmod +x my_program
To submit your job to the HPC cluster scheduler type the following command
qsub submit.sh
The qstat comand is used to check the status of jobs on the cluster. By itself it will show a brief overview
qstat
To show the status of all nodes and queued processes, execute
qstat -u '*'
The state codes that are displayed in the last column of qstat are as follows:
Category | State | SGE Letter Code |
---|---|---|
Pending | pending | qw |
pending, user hold | qw | |
pending, system hold | hqw | |
pending, user and system hold | hqw | |
pending, user hold, re-queue | hRwq | |
pending, system hold, re-queue | hRwq | |
pending, user and system hold, re-queue | hRwq | |
Running | running | r |
transferring | t | |
running, re-submit | Rr | |
transferring, re-submit | Rt | |
Suspended | job suspended | s, ts |
queue suspended | S, tS | |
queue suspended by alarm | T, tT | |
all suspended with re-submit | Rs,Rts,RS,RtS,RT,RtT | |
Error | all pending states with error | Eqw,Ehqw,EhRqw |
Deleted | all running and suspended states with deletion | dr,dt,dRr,dRt,ds,dS,dT,dRs,dRS,dRT |
If the job is currenlty running:
qstat -j <jobId>
After the job has finished:
qacct -j <jobId>
The command to run is:
qdel <job id of process>
This will remove a job from the queue. If the job is in a dr state the -f flag must be used to force the job to stop. The job ID is supplied as an argument to the qdel.
Nodes in the all.q have mixed memory sizes. To ensure that a job lands on a node with sufficient memory, the mem_free resource can be used.
For example, to execute on nodes with at least 60GB of RAM available:
qsub -l mem_free=60G submit.sh
First you will need to create a submission file. Place the following code into the submision file (i.e. submit.sh)
#!/bin/sh # Give the job a name #$ -N JOB_NAME # set the shell #$ -S /bin/sh # set working directory on all host to directory where the job was started #$ -cwd # send all ERROR messages to this file #$ -e errors.txt # Change the email address to YOUR email, and you will be emailed when the job has finished. #$ -m e #$ -M [email protected] # Ask for 1 core, as R can only use 1 core for processing #$ -pe orte 1 # Load the R Module module load R # Commands to run job R inputFile.r > outputFile.out
Then you can submit the job to the cluster with:
qsub submit.sh
Some example submit files for R can be found in the following folder /cm/shared/examples/R.
You can download and install any R Libraries which you might need to run on the cluster into your home directory and simply use them from there. These instructions give you the steps to accomplish this.
1. Load R module
module load R
2. Launch R
R
3. Type in the command to install the desired R package
install.packages("package_name")
If this is the first time running the install.packages() command, you will be asked if you want to create a personal library. Answer 'y'. Follow the prompts to pick a mirror, etc.
R will download and install the library into the newly created personal library (In your home directory). to use the new library, use the library command like any other installed library.
library(library_name_)
Create a submission file (i.e. submit.sh) and place the following code inside:
#!/bin/sh # Give the job a name $ -N JOB_NAME # set the shell $ -S /bin/sh # set working directory on all host to directory where the job was started $ -cwd # send all ERROR messages to this file $ -e errors.txt # Change the email address to YOUR email, and you will be emailed when the job has finished. $ -m e $ -M [email protected] # Use 4 cores for processing $ -pe orte 4 # Load the Gaussian Module module load gaussian/g16 # Commands to run job g16 < inputFile.com > outputFile.out
The submit the job to the cluster:
qsub submit.sh
Some Gaussian examples can be found in the folder /cm/shared/examples/g09 folder.
Thanks to a campus agreement with Mathworks, Matlab Distributed Computing Server is available and installed on the Cosine cluster. In order to use matlab, the module must be loaded
module avail matlab
module load matlab/R2019b
Interactive Matlab
Interactive Matlab sessions can be run in text-only mode. To run Matlab in text-only mode run:
matlab -nodisplay
If you want to run a text-only Matlab session do the following steps:
1. Login to the cluster using ssh.
2. start a session on a node in the cluster using:
qlogin -pe orte <number of cores needed>
3. From within the qlogin session load the matlab module and then run:
module load matlab
matlab -nodisplay
Array jobs should be used when the job does not require any synchronization between tasks. The script will be launched multiple times, with a varying index. The index is accessible via the environment variable SGE_TASK_ID
.
Typical uses of array jobs would include:
An example can be found on the cluster in:
/cm/shared/examples/matlab/array
Rather than submitting SGE jobs that execute Matlab scripts on the cluster nodes, distributed jobs launch tasks on cluster nodes from within Matlab. Distributed jobs require the cluster to be configured within Matlab, and submission scripts which define how tasks should be launched on cluster nodes. The submission of the jobs is performed via Matlab GUI or command line interface.
In order to distributed jobs, you should:
In order to configure it, start Matlab GUI and then go Parallel -> Manage Cluster Profiles
New Window will pop up. In the new window, click on Add -> Custom -> Generic
New profile will be created. Re-name it to something sensible (you will be referring to it through the code). Lets call it Cosine.
Next, make sure you have provided the following info in the Properties tab (leaving all of the other options as default:
Main Properties | |
---|---|
Description of this cluster: | Cosine HPC |
Folder where cluster stores job data: | use default (unless you want to specify alternative location) |
Number of workers available to the cluster: | 32 |
Root folder of MATLAB installation for workers: | use default |
Cluster uses MathWorks hosted licensing: | false |
Submit Functions | |
Function called when submitting independent jobs: | @independentSubmitFcn |
Function called when submitting communicating jobs: | @communicatingSubmitFcn |
Cluster Environment | |
Cluster nodes' operating system: | Unix |
Job storage location is accessible from client and cluster nodes: | yes |
Workers | |
Range of number of workers to run the job: | [1 32] |
Jobs and task functions | |
Function to query cluster about the job state: | @getJobStateFcn |
Function to manage cluster when you call delete on a job: | @deleteJobFcn |
Note, that once profile has been loaded, you can override the settings from the submission script
Once the profile has been set up click ok. Next select newly created profile, and validate the configuration.
You can import a profile using either the Cluster Profile Manager or the Matlab parallel.importProfile(filename)
command.
parallel.importProfile('/cm/shared/examples/matlab/distributed/Cosine.settings');
To import settings from the Cluster Profile Manager, use:
Programmatically
Rather than using a previously defined cluster profile, the cluster details can be configured ad-hoc in a .m script file:
cluster = parallel.cluster.Generic(); cluster.NumWorkers = 32; cluster.JobStorageLocation = '/homes/cosine/helpdesk/matlab/'; cluster.IndependentSubmitFcn = @independentSubmitFcn; cluster.CommunicatingSubmitFcn = @communicatingSubmitFcn; cluster.OperatingSystem = 'unix'; cluster.HasSharedFilesystem = true; cluster.GetJobStateFcn = @getJobStateFcn; cluster.DeleteJobFcn = @deleteJobFcn; cluster.RequiresMathWorksHostedLicensing = false;
To save the cluster definition as a profile for later re-use, use:
cluster.saveAsProfile('Cosine')
To load a previously saved cluster definition, use:
cluster = parcluster('Cosine')
{@communicatingSubmitFcn, 'list_of_additional_qsub_parameters'}
e.g. to specify that 4GB of memory should be requested, and that emails should be sent to [email protected]
at the beginning and end of the job the submit functions should be specified as:
cluster = parcluster('Cosine'); cluster.CommunicatingSubmitFcn = {@communicatingSubmitFcn, '-l h_vmem=5G -m be -M [email protected]'}; pp = parpool(cluster); parfor i=1:10 hn = system('hostname'); disp(hn); end ... delete(pp)
Make sure that the options you pass to the qsub
command are syntactically correct, otherwise the job will fail (see the qsub
man page for the list of available options).
An independent job is defined as follows (from http://www.mathworks.co.uk/help/distcomp/program-independent-jobs.html):
An Independent job is one whose tasks do not directly communicate with each other, that is, the tasks are independent of each other. The tasks do not need to run simultaneously, and a worker might run several tasks of the same job in succession. Typically, all tasks perform the same or similar functions on different data sets in an embarrassingly parallel configuration.
Independent jobs are created using the Matlab createJob()
function.
An independent job example:
/cm/shared/examples/matlab/distributed/independent
Note: Matlab will submit the job to SGE without the need to write a submission script:
matlab -nodisplay < independent.m
A communicating job is defined as follows (from http://www.mathworks.co.uk/help/distcomp/introduction.html):
Communicating jobs are those in which the workers can communicate with each other during the evaluation of their tasks. A communicating job consists of only a single task that runs simultaneously on several workers, usually with different data. More specifically, the task is duplicated on each worker, so each worker can perform the task on a different set of data, or on a particular segment of a large data set. The workers can communicate with each other as each executes its task. The function that the task runs can take advantage of a worker's awareness of how many workers are running the job, which worker this is among those running the job, and the features that allow workers to communicate with each other.
Communicating jobs are required for:
parfor
loops which allow multiple loop iterations to be executed in parallel;spmd
blocks which run a single program on multiple data - i.e. the same program runs on all workers with behaviour determined by the varying data on each worker (see here).
Communicating jobs are created using the Matlab createCommunicatingJob()
function and can have a Type of either pool
or spmd
:
pool
job runs the supplied task on one worker and uses the remaining workers as a pool to execute parfor
loops, spmd
blocks etc., the total number of workers available for parallel code is therefore one less than the total number of workers;spmd
job runs the supplied task on all the workers, with no task fundamentally in control - effectively, an spmd
job acts as if the entire task is within an spmd
block.
Communication beween spmd workers (whether in an spmd
job or spmd
block) occurs using the lab*
functions (see Matlab help). Control of spmd workers is usually exerted by message passing and testing data values (e.g. using the worker with labindex
of 1 to control other workers).
An independent job example:
/cm/shared/examples/matlab/distributed/communication
Note: Matlab will submit the job to SGE without the need to write a submission script:
matlab -nodisplay < communication.m
The examples in both the communicating and independent jobs sections submit the job then wait (block) until the job is complete, subsequently extracting the results and deleting the job, i.e.
cluster = parcluster('Cosine'); ijob = createJob(cluster); .... submit(ijob); wait(job, 'finished'); %Wait for the job to finish results = getAllOutputArguments(ijob); %retrieve results ... destroy(job); %destroy the job
In some situations, this might not be desired - e.g. where client is not allowed to run for long times on the submit host. In such cases a non-blocking submit script should be used instead. The only difference to the communicating and independent scrips defined earlier is that a non-blocking job doesn't have the wait
and destroy
calls.
A non-blocking independent submit script:
/cm/shared/examples/matlab/distributed/independent/independent_noblock.m
A non-blocking communicating submit script:
/cm/shared/examples/matlab/distributed/communication/communication_noblock.m
Once the job has been completed, the results can be fetched programmatically:
cluster = parcluster('Cosine'); job = cluster.findJob('ID',1); job_output = fetchOutputs(job);
The ID used in cluster.findJob('ID', ...)
above is the internal Matlab job ID as displayed at the end of the example non-blocking submit scripts, not the SGE job ID.
Once you have finished with it you can delete it using
destroy(job);
The node cosine004 has two NVIDIA Tesla K40m GPU Computing Accelerators. Each card provides 2880 cores and 12GB of RAM. Each card has been set to exclusive mode, meaning only one process can access the gpu at a time.
The device names of these cards are /dev/nvidia0 and /dev/nvidia1.
A dedicated queue, gpu.q has been created for these resources.
For interactive use, use qlogin and specify the queue:
qlogin -q gpu.q
For batch use, use qsub in the standard fashion, but specify the queue:
qsub -q gpu.q submit.sh
CUDA 7.5 tools are installed, but must be loaded with the modules system, typically you will include the toolkit and the gdk
module load cuda75/toolkit/7.5.18 module load cuda75/gdk/352.79
NOTE: when comping software with nvcc, there is a module conflict with gcc/5.1.0, remove this module to use the system gcc 4.8.5
module unload gcc/5.1.0
A simple CUDA example can be found in the directory: /cm/shared/examples/cuda
The Cosine cluster is protected by a firewall which blocks access from off campus. Users need to establish a VPN connection or first ssh into a campus server before connecting to the submit node.
We recommend connecting using the VPN. However, if you cannot use the OSU VPN but are able to remote into a campus PC, you can SSH from the campus PC to access the cluster.
If you prefer to work locally when submitting jobs and accessing the cluster, you can also set up SSH multi-hopping.
If you run into an issue where:
[root@cluster-submit ~]# qmon Warning: Cannot convert string "-adobe-helvetica-medium-r-*--14-*-*-*-p-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-helvetica-bold-r-*--14-*-*-*-p-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-helvetica-medium-r-*--20-*-*-*-p-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-helvetica-medium-r-*--12-*-*-*-p-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-helvetica-medium-r-*--24-*-*-*-p-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct Warning: Cannot convert string "-adobe-helvetica-medium-r-*--10-*-*-*-p-*-*-*" to type FontStruct X Error of failed request: BadName (named color or font does not exist) Major opcode of failed request: 45 (X_OpenFont) Serial number of failed request: 525 Current serial number in output stream: 536
Appears when attempting to run qmon, then you will need to install a package on your machine.
For machines running Ubuntu:
apt-get install xfstt service xfstt start apt-get install xfonts-75dpi xset +fp /usr/share/fonts/X11/75dpi xset fp rehash
HARDWARE | QUEUE ASSIGNMENT | ||||||||||||
NODE | CPU | CORES | RAM (GB) | Network | all.q | di.q | finch.q | hendrix-gpu.q | sun-gpu.q | gpu.q | lazzati.q | schneider.q | test.q |
TOTALS: | 1064 | 3712 | 416 | 144 | 24 | 2 | 1 | 2 | 480 | 48 | 24 | ||
cosine001 | E5-2620 v3 @ 2.40GHz | 24 | 64 | 1 Gbps | 12 | 1 | 24 | ||||||
cosine002 | E5-2680 v3 @ 2.50GHz | 48 | 256 | 1 Gbps | 48 | ||||||||
cosine003 | E5-2697 v2 @ 2.70GHz | 48 | 256 | 1 Gbps | 48 | ||||||||
cosine004 | E5-2620 v4 @ 2.10GHz | 32 | 64 | 1 Gbps | 28 | 2 | 2 | ||||||
cosine005 | E5-2695 v4 @ 2.10GHz | 72 | 256 | 1 Gbps | 72 | ||||||||
cosine006 | E5-2620 v4 @ 2.10GHz | 16 | 64 | 1 Gbps | 16 | ||||||||
cosine007 | E5-2620 v4 @ 2.10GHz | 16 | 64 | 1 Gbps | 16 | ||||||||
cosine008 | E5-2620 v4 @ 2.10GHz | 16 | 64 | 1 Gbps | 16 | ||||||||
cosine009 | Silver 4216 CPU @ 2.10GHz | 64 | 96 | 1 Gbps | 64 | ||||||||
cosine010 | Silver 4216 CPU @ 2.10GHz | 64 | 96 | 1 Gbps | 64 | ||||||||
di001 | E5-2680 v3 @ 2.50GHz | 48 | 256 | 1 Gbps | 48 | ||||||||
di002 | E5-2680 v3 @ 2.50GHz | 48 | 256 | 1 Gbps | 48 | ||||||||
di003 | E5-2680 v3 @ 2.50GHz | 48 | 256 | 1 Gbps | 48 | ||||||||
finch001 | E5-2630 v2 @ 2.60GHz | 24 | 128 | 1 Gbps | 12 | 24 | |||||||
lazzati001 | E5-2695 v3 @ 2.30GHz | 48 | 128 | FDR | 56 | ||||||||
lazzati002 | E5-2695 v3 @ 2.30GHz | 48 | 128 | FDR | 56 | ||||||||
lazzati003 | E5-2695 v3 @ 2.30GHz | 48 | 128 | FDR | 56 | ||||||||
lazzati004 | E5-2695 v3 @ 2.30GHz | 48 | 128 | FDR | 56 | ||||||||
lazzati005 | Gold 5218 CPU @ 2.30GHz | 64 | 192 | FDR | 64 | ||||||||
lazzati006 | Gold 5218 CPU @ 2.30GHz | 64 | 192 | FDR | 64 | ||||||||
lazzati007 | Gold 5218 CPU @ 2.30GHz | 64 | 192 | FDR | 64 | ||||||||
lazzati008 | Gold 5218 CPU @ 2.30GHz | 64 | 192 | FDR | 64 | ||||||||
schneider001 | E5-2620 @ 2.00 GHz | 24 | 128 | 1 Gbps | 24 | 24 | |||||||
schneider002 | E5-2630 v2 @ 2.60 GHz | 24 | 128 | 1 Gbps | 24 | 24 | |||||||
TOTALS: | 1064 | 3712 | 416 | 144 | 24 | 2 | 1 | 2 | 480 | 48 | 24 |
Where cores are allocated to more than one queue the investor queues take precedence during scheduling.
GPU Resources
cosine004: gpu.q 2 NVIDIA Tesla K40m GPUs 2880 cores and 12GB of RAM.
cosine004: hendrix-gpu.q 2 NVIDIA Tesla K40c GPUs 2880 cores and 12GB of RAM.
cosine001: sun-gpu.q 1 NVIDIA Tesla K40m GPUs 2880 cores and 12GB of RAM.
If this error occurs, you need to load an a newer version of gcc that has an updated libstdc++so.6 library. In your submit script, add the following lines to switch from gcc 5.1.0 to 9.2.0
module unload gcc/5.1.0
module load gcc/9.2.0
You can use the rclone command to copy data up to your OneDrive account. Do the following to setup rclone to access your account. NOTE: You will need to install rclone on your own local system in order to complete these steps.
Install rclone instructions: https://rclone.org/install/
Setup rclone, type in the following commands at the command line prompt after logging into the HPC cluster.
1. rclone config
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> (Choose 'n' for new remote)
name> (Give it the remote connection a name, like "onedrive" for example)
Next select MicroSoft OneDrive from the list.
Storage> 26
Client_id> (leave blank)
Client_secret> (leave blank)
region> 1 (Choose Microsoft Cloud Global)
Edit advanced config? (y/n) (Choose 'n' No)
Use auto config? (Choose No here, 'n')
This is where you now need to have rclone installed on your own local system with a web browser. It will display a command that you will need to run on your own local system.
rclone authorize "onedrive"
This should bring up your web browser to login to your OneDrive account using your [email protected] account and password. If it succeeds, it will display a HUGE token string on your local system. Copy and paste that into the prompt on the cluster command line:
Then paste the result below:
result>
Next choose option 1, for OneDrive Personal or Business.
Your choice> 1
Found 1 drives, please select the one you want to use:
0: OneDrive (business) id=b!s3dZbhXKhEmHiDj05mcVJBFcFFQRHa1YoycARRI6Y-l8asIyU
Chose drive to use:>0
Found drive 'root' of type 'business', URL: https://oregonstateuniversity-my.sharepoint.com/personal/{ONID}_oregonstate_edu/Documents
Is that okay?
y) Yes (default)
n) No
y/n> Y
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y
Current remotes:
Name Type
==== ====
onedrive onedrive
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
That's it you are now set to use rclone to copy files to your OneDrive account.
List files on OneDrive:
rclone list onedrive:/
Copy file up to OneDrive
rclone copy file_name onedrive:/.
Sync a folder up
rclone sync /path/to/local/dir onedrive: --progress