Siginifcant Projects¶

Genomics in the Cloud (AWS/EC2) - 2020-01 - 2020-09¶

Goal: Proof of concept: Test the feasibility and cost of running various HPC workloads in the cloud.

Collaborative Project with Amazon Team (NYC) and Weill Cornell Medicine (NYC).

Technology Stack: AWS EC2, AWS S3, Docker, CWL, anaconda/bioconda, Terraform and cloudformation

Modernise HPC Infrastructure - 2024-01 - 2017-08¶

Goals:

Minimise toolset and standardise OS using a single universal base image.
Simplify node management and enhance their role flexibility.

Summary:

This was a large and complex project. I had started on my own, but was thankfully joined by Sam a year into it. Sam brought with him the idea of running the entire Slurm cluster inside docker. Simplifying the code base and allowing us to elegantly deploy Slurm clusters the same way we deployed any other web app, as well as enabling management of our HPC workloads using our standard docker toolset.

Challenges:

With nearly every series of Mellanox ConnectX Infiniband Adaptor ever produced. (ConnectX2, ConnectX3,ConnectX3-Pro,ConnectX4 and ConnectX-5), I had to create code to compile custom kernels for 6 different HW types.
Slurm did not play well with the dynamic naming so common with containers. Getting Slurm to work inside Docker was no joke, mainly because so much of its configuration had to be hardcoded at the time. A mammoth scripting challenge, which we both ultimately enjoyed.

DIaaS4R - Dynamic Infrastructure-as-a-Service for Research - 2011-09 - 2013-01¶

Goal: Reduce Cost and Time needed for Research teams to acquire IT resources for Grant funded research.

Build Internal S/IaaS Cloud for Self Service Deployment of Scientific Applications.
oVirt KVM + IBM GPFS Filesystem on DDN GridSCaler Storage
Deployment and Orchestration using images, Cobbler and Puppet.
Code, Bug and Task management using Redmine and Git
Pre-defined templates for common project apps, such as, Wiki, Git, R/Perl/Ruby and Python(BioConda), Fileshare, Latex Publishing, Forum, Q&A and Project Management.

Summary:

This system drastically reduced the time and cost associated with IT resource procurement and deployment. Users could now deploy their servers and apps on-the-fly and almost instantly. A significant benefit for financially thin and time constrained projects funded with Research Grant money.

HPC Farm for Biomedical Research in Genetics - 2012-01 - 2012-09¶

Goal: Build HPC CPU/GPU Compute Cluster and configure it for Parallel computing.

20 Servers, 5TB RAM, 448 (896 Logical) Intel Sandy-Bridge Cores92 CUDA® parallel processing cores (Nvidia Tesla 2090)
5 Massive Memory Servers (512GB Each) for Gene Analysis
1.2 PB of Storage in Single GPFS Parallel FS (Scalable to 20PB)
High Throughput SMB/Webdav/NFS/SSHFS Fileshare
Mellanox 40GBps QDR Infiniband storage and MPI interconnect

Toolset: BrightComputing Cluster Manager (BM Provisioning and Monitoring), slurm/globusftp/mpich2/openmpi, Authentication, rights and fileshare (MS-AD and SSSD RFC2307 Schema, samba4, NFS-Ganesha and nginx clustered and balanced with CTDB), HPC Batch (Slurm, openmpi/mpich2, globus gridftp), App compilation, Toolchaining (anaconda/bioconda, environment-modules, EasyBuild, Intel C Compiler Suite )

Challenges

Refactoring our code to be compatible with Slurm and the toolchains.
Communicating and getting consensus for some very long outage windows.
Practically zero experience with GPFS File System. SGE/UGE, Slurm and MPI
With such a steep learning curve, I ended up providing one-on-one training for almost 30 people.

Even though I had just learnt it by myself.