It’s been one month since I gained access to a HPC cluster, and I’ve finally cracked the complex setup for using my local PC as a broker to communicate with the HPC node via MQTT (as a subscriber or publisher). The process was far from straightforward(painful by the way😭), but I’m thrilled to have succeeded! I’ve documented the entire setup in A Guide to VS Code, MQTT, and SSH on HPC Clusters, to help others navigate the challenge.
However, working with HPC cluster has revealed some unexpected hurdles. Unlike a typical cloud virtual machine, the HPC environment feels sluggish for certain tasks. For instance, compiling a CUDA example on a 32-core CPU I requested was surprisingly slow - even with the cmake .. step took up to 10 minutes. This is a stark contrast to the performance I’d expect from a local machine or a standard cloud VM.
My hypothesis is that the bottleneck lies in the network disk’s I/O throughput, which seems significantly lower than expected. Despite exploring various configurations, I haven’t yet found out a way to match the compilation speed of a local or cloud-based setup.
Remark
My experience suggests that HPC clusters may be best suited for batch processing tasks like training large models, rather than interactive development or compilation-heavy workflows. The I/O limitations can make certain tasks feel cumbersome.