Digital Research & Infrastructure

Submit Help Request

FAQs

There are a couple of options on how to access the HPC cluster. The first method, and the required method for first-time access, is to SSH into submit.hpc.cosine.oregonstate.edu using your cosine/science username and password. When you SSH into the cluster for the first time, a process is initiated to create and set up your home folder.  The second option, after you have successfully logged into the HPC cluster with SSH, is to access the HPC cluster is using the web interface.

https://ondemand.science.oregonstate.edu

 If connecting from Off campus, please refer to https://oregonstate.teamdynamix.com/TDClient/1935/Portal/KB/?CategoryID=6889

Join the Cosine HPC mailing list for notifications about software updates and maintenance. Visit http://lists.science.oregonstate.edu/mailman/listinfo/cosine-hpc

The command line smclient can be used to access Windows shares easily.  Using smbclient a remote Windows shares can be listed, uploaded, deleted, or navigated easily. The smbclient command also provides an interactive shell very similar to FTP or SFTP. 

For example, connecting to your Z:drive:

smbclient //pfs.cosine.oregonstate.edu/{ONID} -U {ONID}

To list all shares:

smbclient -L {file_server_name}

For interactive shell:

smbclient //{server_name}/{share} -U {username/ONID}

The remote share can be connected like FTP/SFTP and a new shell is provided via the smbclient. This shell can be used to navigate, list, upload, download, etc. files.

To list files/folders on the share:

smb: \> ls

To list files/folders on the local system:

smb: \> !ls

To change directory on the share:

smb: \> cd {folder_name}

To change directory on the local system:

smb: \> lcd {folder_name}

To download a file from the share:

smb: \> get {file_name}

To upload a file to the share:

smb: \> put {file_name}

Files and folders can be uploaded with the mput command. But in order to upload specified folder and its content the recursive mode should be enabled with the recurse command. Also the upload can be started with the mput command.

smb: \> recurse

smb: \> mput {folder_name}

Files and folders can be also downloaded with the mget command. If there are multiple files and folders to download the recursive mode should be enabled with the recurse command.

smb: \> recurse

smb: \> mget {folder_name}

To end your session type "quite" at the interactive shell prompt:

smb: \> quite

Yes you can.  

  1. Create a directory in your home folder for the z drive mount point. 
  2. sshfs shell.cosine.oregonstate.edu:{folder_name}
  3. For performance reasons, please copy files from zdrive to your cluster home folder before submitting your job to the cluster.

Example

mkdir ~/my_zdrive

sshfs shell.cosine.oregonstate.edu: ~/my_zdrive

NOTE: Be sure to umount your Z:drive after you are finished with the following command:

fusermount -u ~/my_zdrive

The cluster uses the environments module to provide an easy way to switch between software revisions. These modules configure environmental variables such as PATH for each piece of software

To get a list of available modules to load, execute:

module avail

To get a list of what modules are currently loaded, execute:

module list

To load the Matlab 2014b module, execute:

module load matlab/R2014b

To display modules set to load during login:

module initlist

To set a module to automatically load during login:

module initadd matlab/R2014b

To remove a module from loading during login:

module initrm matlab/R2014b

Jobs should be submitted using a special sh script which tells the scheduler how to handle the job.

An example with common options can be seen below: 

 submit.sh

#!/bin/sh

# Give the job a name
#$ -N example_job

# set the shell
#$ -S /bin/sh

# set working directory on all host to
# directory where the job was started
#$ -cwd

# send all process STDOUT (fd 2) to this file
#$ -o job_output.txt

# send all process STDERR (fd 3) to this file
#$ -e job_output.err

# email information
#$ -m e
 
# Just change the email address. You will be emailed when the job has finished.
#$ -M [email protected]

# generic parallel environment with 2 cores requested
#$ -pe orte 2

# Load a module, if needed
module load sprng/5

# Commands
./my_program

Insure your script or program is executable by running the following command

chmod +x my_program

To submit your job to the HPC cluster scheduler type the following command

qsub submit.sh

The qstat comand is used to check the status of jobs on the cluster.  By itself it will show a brief overview

qstat

To show the status of all nodes and queued processes, execute

qstat -u '*'

The state codes that are displayed in the last column of qstat are as follows:

Category State SGE Letter Code
Pending pending qw
  pending, user hold qw
  pending, system hold hqw
  pending, user and system hold hqw
  pending, user hold, re-queue hRwq
  pending, system hold, re-queue hRwq
  pending, user and system hold, re-queue hRwq
Running running r
  transferring t
  running, re-submit Rr
  transferring, re-submit Rt
Suspended job suspended s, ts
  queue suspended S, tS
  queue suspended by alarm T, tT
  all suspended with re-submit Rs,Rts,RS,RtS,RT,RtT
Error all pending states with error Eqw,Ehqw,EhRqw
Deleted all running and suspended states with deletion dr,dt,dRr,dRt,ds,dS,dT,dRs,dRS,dRT

 

If the job is currenlty running:

qstat -j <jobId>

After the job has finished:

qacct -j <jobId>

The command to run is:

qdel <job id of process>

This will remove a job from the queue. If the job is in a dr state the -f flag must be used to force the job to stop. The job ID is supplied as an argument to the qdel.

Nodes in the all.q have mixed memory sizes. To ensure that a job lands on a node with sufficient memory, the mem_free resource can be used.

For example, to execute on nodes with at least 60GB of RAM available:

qsub -l mem_free=60G submit.sh

 

First you will need to create a submission file. Place the following code into the submision file (i.e. submit.sh)

#!/bin/sh
# Give the job a name
#$ -N JOB_NAME
# set the shell
#$ -S /bin/sh
# set working directory on all host to directory where the job was started
#$ -cwd
# send all ERROR messages to this file
#$ -e errors.txt
# Change the email address to YOUR email, and you will be emailed when the job has finished.
#$ -m e
#$ -M [email protected]
# Ask for 1 core, as R can only use 1 core for processing
#$ -pe orte 1
# Load the R Module
module load R
# Commands to run job
R inputFile.r > outputFile.out

Then you can submit the job to the cluster with:

qsub submit.sh

Some example submit files for R can be found in the following folder /cm/shared/examples/R.

You can download and install any R Libraries which you might need to run on the cluster into your home directory and simply use them from there. These instructions give you the steps to accomplish this.

 1. Load R module

module load R

2. Launch R

R

3. Type in the command to install the desired R package

install.packages("package_name")

If this is the first time running the install.packages() command, you will be asked if you want to create a personal library. Answer 'y'. Follow the prompts to pick a mirror, etc.

R will download and install the library into the newly created personal library (In your home directory). to use the new library, use the library command like any other installed library.

library(library_name_)

Create a submission file (i.e. submit.sh) and place the following code inside:

#!/bin/sh
# Give the job a name
$ -N JOB_NAME
# set the shell
$ -S /bin/sh
# set working directory on all host to directory where the job was started
$ -cwd
# send all ERROR messages to this file
$ -e errors.txt
# Change the email address to YOUR email, and you will be emailed when the job has finished.
$ -m e
$ -M [email protected]
# Use 4 cores for processing
$ -pe orte 4
# Load the Gaussian Module
module load gaussian/g16
# Commands to run job
g16 < inputFile.com > outputFile.out

The submit the job to the cluster:

qsub submit.sh

Some Gaussian examples can be found in the folder /cm/shared/examples/g09 folder.

Thanks to a campus agreement with Mathworks, Matlab Distributed Computing Server is available and installed on the Cosine cluster. In order to use matlab, the module must be loaded

module avail matlab
module load matlab/R2019b

Interactive Matlab

Interactive Matlab sessions can be run in text-only mode.  To run Matlab in text-only mode run:

matlab -nodisplay

If you want to run a text-only Matlab session do the following steps:

1. Login to the cluster using ssh.

2. start a session on a node in the cluster using:

qlogin -pe orte <number of cores needed>

3. From within the qlogin session load the matlab module and then run:

module load matlab

matlab -nodisplay

 

Array jobs should be used when the job does not require any synchronization between tasks. The script will be launched multiple times, with a varying index. The index is accessible via the environment variable SGE_TASK_ID.

Typical uses of array jobs would include:

  • processing a set of input files with each job processing a different file;
  • processing a single large file using multiple jobs each of which processes a section of the file;
  • examining the performance of a model using multiple sets of model parameters.

An example can be found on the cluster in:

/cm/shared/examples/matlab/array

Rather than submitting SGE jobs that execute Matlab scripts on the cluster nodes, distributed jobs launch tasks on cluster nodes from within Matlab. Distributed jobs require the cluster to be configured within Matlab, and submission scripts which define how tasks should be launched on cluster nodes. The submission of the jobs is performed via Matlab GUI or command line interface.

In order to distributed jobs, you should:

  1. Configure Matlab to use the cluster, either using a cluster profile or programmatically
  2. Create a independent and/or communicating job submission script
  3. Submit (run) your job

MATLAB CLUSTER PROFILES

Using GUI configuration utility

In order to configure it, start Matlab GUI and then go Parallel -> Manage Cluster Profiles 

New Window will pop up. In the new window, click on Add -> Custom -> Generic

New profile will be created. Re-name it to something sensible (you will be referring to it through the code). Lets call it Cosine.

Next, make sure you have provided the following info in the Properties tab (leaving all of the other options as default:

Main Properties
Description of this cluster: Cosine HPC
Folder where cluster stores job data: use default (unless you want to specify alternative location)
Number of workers available to the cluster: 32
Root folder of MATLAB installation for workers: use default
Cluster uses MathWorks hosted licensing: false
Submit Functions
Function called when submitting independent jobs: @independentSubmitFcn
Function called when submitting communicating jobs: @communicatingSubmitFcn
Cluster Environment
Cluster nodes' operating system: Unix
Job storage location is accessible from client and cluster nodes: yes
Workers
Range of number of workers to run the job: [1 32]
Jobs and task functions
Function to query cluster about the job state: @getJobStateFcn
Function to manage cluster when you call delete on a job: @deleteJobFcn

Note, that once profile has been loaded, you can override the settings from the submission script

Once the profile has been set up click ok. Next select newly created profile, and validate the configuration.

You can import a profile using either the Cluster Profile Manager or the Matlab parallel.importProfile(filename) command.

parallel.importProfile('/cm/shared/examples/matlab/distributed/Cosine.settings');

To import settings from the Cluster Profile Manager, use:

  • Parallel -> Manage Cluster Profiles
  • Add -> Import
  • and select the appropriate settings file.

Programmatically

Rather than using a previously defined cluster profile, the cluster details can be configured ad-hoc in a .m script file:

cluster = parallel.cluster.Generic();
cluster.NumWorkers = 32;
cluster.JobStorageLocation = '/homes/cosine/helpdesk/matlab/';
cluster.IndependentSubmitFcn = @independentSubmitFcn;
cluster.CommunicatingSubmitFcn = @communicatingSubmitFcn;
cluster.OperatingSystem = 'unix';
cluster.HasSharedFilesystem = true;
cluster.GetJobStateFcn = @getJobStateFcn;
cluster.DeleteJobFcn = @deleteJobFcn;
cluster.RequiresMathWorksHostedLicensing = false;

To save the cluster definition as a profile for later re-use, use:

cluster.saveAsProfile('Cosine')

To load a previously saved cluster definition, use:

cluster = parcluster('Cosine')
Passing Additional Parameters to SGE
If you want to pass additional arguments to the SGE specify the submit function as {@communicatingSubmitFcn, 'list_of_additional_qsub_parameters'}

e.g. to specify that 4GB of memory should be requested, and that emails should be sent to [email protected] at the beginning and end of the job the submit functions should be specified as:

cluster = parcluster('Cosine');
cluster.CommunicatingSubmitFcn = {@communicatingSubmitFcn, '-l h_vmem=5G -m be -M [email protected]'};
pp = parpool(cluster);
parfor i=1:10
        hn = system('hostname');
        disp(hn);
end
...
delete(pp)

Make sure that the options you pass to the qsub command are syntactically correct, otherwise the job will fail (see the qsub man page for the list of available options).

An independent job is defined as follows (from http://www.mathworks.co.uk/help/distcomp/program-independent-jobs.html):

An Independent job is one whose tasks do not directly communicate with each other, that is, the tasks are independent of each other. The tasks do not need to run simultaneously, and a worker might run several tasks of the same job in succession. Typically, all tasks perform the same or similar functions on different data sets in an embarrassingly parallel configuration.

Independent jobs are created using the Matlab createJob() function.

An independent job example:

/cm/shared/examples/matlab/distributed/independent

Note: Matlab will submit the job to SGE without the need to write a submission script:

matlab -nodisplay < independent.m 

A communicating job is defined as follows (from http://www.mathworks.co.uk/help/distcomp/introduction.html):

Communicating jobs are those in which the workers can communicate with each other during the evaluation of their tasks. A communicating job consists of only a single task that runs simultaneously on several workers, usually with different data. More specifically, the task is duplicated on each worker, so each worker can perform the task on a different set of data, or on a particular segment of a large data set. The workers can communicate with each other as each executes its task. The function that the task runs can take advantage of a worker's awareness of how many workers are running the job, which worker this is among those running the job, and the features that allow workers to communicate with each other.

Communicating jobs are required for:

  • parfor loops which allow multiple loop iterations to be executed in parallel;
  • spmd blocks which run a single program on multiple data - i.e. the same program runs on all workers with behaviour determined by the varying data on each worker (see here).

Communicating jobs are created using the Matlab createCommunicatingJob() function and can have a Type of either pool or spmd:

  • pool job runs the supplied task on one worker and uses the remaining workers as a pool to execute parfor loops, spmd blocks etc., the total number of workers available for parallel code is therefore one less than the total number of workers;
  • spmd job runs the supplied task on all the workers, with no task fundamentally in control - effectively, an spmd job acts as if the entire task is within an spmd block.

Communication beween spmd workers (whether in an spmd job or spmd block) occurs using the lab* functions (see Matlab help). Control of spmd workers is usually exerted by message passing and testing data values (e.g. using the worker with labindex of 1 to control other workers).

An independent job example:

/cm/shared/examples/matlab/distributed/communication

Note: Matlab will submit the job to SGE without the need to write a submission script:

matlab -nodisplay < communication.m

The examples in both the communicating and independent jobs sections submit the job then wait (block) until the job is complete, subsequently extracting the results and deleting the job, i.e.

cluster = parcluster('Cosine');
ijob = createJob(cluster);
....
submit(ijob);
wait(job, 'finished');  %Wait for the job to finish
results = getAllOutputArguments(ijob); %retrieve results
...
destroy(job); %destroy the job

In some situations, this might not be desired - e.g. where client is not allowed to run for long times on the submit host. In such cases a non-blocking submit script should be used instead. The only difference to the communicating and independent scrips defined earlier is that a non-blocking job doesn't have the wait and destroy calls.

A non-blocking independent submit script:

/cm/shared/examples/matlab/distributed/independent/independent_noblock.m 

A non-blocking communicating submit script:

/cm/shared/examples/matlab/distributed/communication/communication_noblock.m 

Once the job has been completed, the results can be fetched programmatically:

cluster = parcluster('Cosine');
job = cluster.findJob('ID',1);
job_output = fetchOutputs(job);

 The ID used in cluster.findJob('ID', ...) above is the internal Matlab job ID as displayed at the end of the example non-blocking submit scripts, not the SGE job ID.

Once you have finished with it you can delete it using

destroy(job);

The node cosine004 has two NVIDIA Tesla K40m GPU Computing Accelerators. Each card provides 2880 cores and 12GB of RAM. Each card has been set to exclusive mode, meaning only one process can access the gpu at a time.

The device names of these cards are /dev/nvidia0 and /dev/nvidia1. 

A dedicated queue, gpu.q has been created for these resources. 

For interactive use, use qlogin and specify the queue:

 qlogin -q gpu.q

For batch use, use qsub in the standard fashion, but specify the queue:

qsub -q gpu.q submit.sh

CUDA 7.5 tools are installed, but must be loaded with the modules system, typically you will include the toolkit and the gdk

module load cuda75/toolkit/7.5.18
module load cuda75/gdk/352.79

NOTE: when comping software with nvcc, there is a module conflict with gcc/5.1.0, remove this module to use the system gcc 4.8.5 

module unload gcc/5.1.0

A simple CUDA example can be found in the directory: /cm/shared/examples/cuda

The Cosine cluster is protected by a firewall which blocks access from off campus. Users need to establish a VPN connection or first ssh into a campus server before connecting to the submit node.

We recommend connecting using the VPN. However, if you cannot use the OSU VPN but are able to remote into a campus PC, you can SSH from the campus PC to access the cluster.
If you prefer to work locally when submitting jobs and accessing the cluster, you can also set up SSH multi-hopping.

If you run into an issue where:

[root@cluster-submit ~]# qmon
Warning: Cannot convert string "-adobe-helvetica-medium-r-*--14-*-*-*-p-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-helvetica-bold-r-*--14-*-*-*-p-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-helvetica-medium-r-*--20-*-*-*-p-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-helvetica-medium-r-*--12-*-*-*-p-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-helvetica-medium-r-*--24-*-*-*-p-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-helvetica-medium-r-*--10-*-*-*-p-*-*-*" to type FontStruct
X Error of failed request: BadName (named color or font does not exist)
Major opcode of failed request: 45 (X_OpenFont)
Serial number of failed request: 525
Current serial number in output stream: 536

Appears when attempting to run qmon, then you will need to install a package on your machine.

For machines running Ubuntu:

apt-get install xfstt
service xfstt start
apt-get install xfonts-75dpi
xset +fp /usr/share/fonts/X11/75dpi
xset fp rehash
HARDWARE QUEUE ASSIGNMENT
NODE CPU CORES RAM (GB) Network all.q di.q finch.q hendrix-gpu.q sun-gpu.q gpu.q lazzati.q schneider.q test.q
TOTALS:   1064 3712   416 144 24 2 1 2 480 48 24
cosine001 E5-2620 v3 @ 2.40GHz 24 64 1 Gbps 12       1       24
cosine002 E5-2680 v3 @ 2.50GHz 48 256 1 Gbps 48                
cosine003 E5-2697 v2 @ 2.70GHz 48 256 1 Gbps 48                
cosine004 E5-2620 v4 @ 2.10GHz 32 64 1 Gbps 28     2   2      
cosine005 E5-2695 v4 @ 2.10GHz 72 256 1 Gbps 72                
cosine006 E5-2620 v4 @ 2.10GHz 16 64 1 Gbps 16                
cosine007 E5-2620 v4 @ 2.10GHz 16 64 1 Gbps 16                
cosine008 E5-2620 v4 @ 2.10GHz 16 64 1 Gbps 16                
cosine009 Silver 4216 CPU @ 2.10GHz 64 96 1 Gbps 64                
cosine010 Silver 4216 CPU @ 2.10GHz 64 96 1 Gbps 64                
di001 E5-2680 v3 @ 2.50GHz 48 256 1 Gbps   48              
di002 E5-2680 v3 @ 2.50GHz 48 256 1 Gbps   48              
di003 E5-2680 v3 @ 2.50GHz 48 256 1 Gbps   48              
finch001 E5-2630 v2 @ 2.60GHz 24 128 1 Gbps 12   24            
lazzati001 E5-2695 v3 @ 2.30GHz 48 128 FDR             56    
lazzati002 E5-2695 v3 @ 2.30GHz 48 128 FDR             56    
lazzati003 E5-2695 v3 @ 2.30GHz 48 128 FDR             56    
lazzati004 E5-2695 v3 @ 2.30GHz 48 128 FDR             56    
lazzati005 Gold 5218 CPU @ 2.30GHz 64 192 FDR             64    
lazzati006 Gold 5218 CPU @ 2.30GHz 64 192 FDR             64    
lazzati007 Gold 5218 CPU @ 2.30GHz 64 192 FDR             64    
lazzati008 Gold 5218 CPU @ 2.30GHz 64 192 FDR             64    
schneider001 E5-2620 @ 2.00 GHz 24 128 1 Gbps 24             24  
schneider002 E5-2630 v2 @ 2.60 GHz 24 128 1 Gbps 24             24  
TOTALS:   1064 3712   416 144 24 2 1 2 480 48 24


Where cores are allocated to more than one queue the investor queues take precedence during scheduling.

GPU Resources

cosine004: gpu.q 2 NVIDIA Tesla K40m GPUs 2880 cores and 12GB of RAM.

cosine004: hendrix-gpu.q 2 NVIDIA Tesla K40c GPUs 2880 cores and 12GB of RAM.

cosine001: sun-gpu.q 1 NVIDIA Tesla K40m GPUs 2880 cores and 12GB of RAM.

If this error occurs, you need to load an a newer version of gcc that has an updated libstdc++so.6 library. In your submit script, add the following lines to switch from gcc 5.1.0 to 9.2.0

module unload gcc/5.1.0
module load gcc/9.2.0

You can use the rclone command to copy data up to your OneDrive account.  Do the following to setup rclone to access your account. NOTE: You will need to install rclone on your own local system in order to complete these steps.

Install rclone instructions: https://rclone.org/install/

Setup rclone, type in the following commands at the command line prompt after logging into the HPC cluster.

1. rclone config

No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> (Choose 'n' for new remote)

name>  (Give it the remote connection a name, like "onedrive" for example)

Next select MicroSoft OneDrive from the list.

Storage> 26 

Client_id> (leave blank)

Client_secret> (leave blank)

region> 1 (Choose Microsoft Cloud Global)

Edit advanced config? (y/n) (Choose 'n' No)

Use auto config? (Choose No here, 'n')

This is where you now need to have rclone installed on your own local system with a web browser.  It will display a command that you will need to run on your own local system.

rclone authorize "onedrive"

This should bring up your web browser to login to your OneDrive account using your [email protected] account and password.  If it succeeds, it will display a HUGE token string on your local system. Copy and paste that into the prompt on the cluster command line:

Then paste the result below:
result>

Next choose option 1, for OneDrive Personal or Business.

Your choice> 1

Found 1 drives, please select the one you want to use:
0: OneDrive (business) id=b!s3dZbhXKhEmHiDj05mcVJBFcFFQRHa1YoycARRI6Y-l8asIyU
Chose drive to use:>0

Found drive 'root' of type 'business', URL: https://oregonstateuniversity-my.sharepoint.com/personal/{ONID}_oregonstate_edu/Documents
Is that okay?
y) Yes (default)
n) No
y/n> Y

y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y

Current remotes:

Name                 Type
====                 ====
onedrive             onedrive

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

That's it you are now set to use rclone to copy files to your OneDrive account.

List files on OneDrive:

rclone list onedrive:/

Copy file up to OneDrive

rclone copy file_name onedrive:/.

Sync a folder up

rclone sync /path/to/local/dir onedrive: --progress