Difference between revisions of "Mathematica Remote Kernels"
(→Tips) |
|||
Line 59: | Line 59: | ||
If you really want to geek out, you can look at the server room power consumption when you're running your job. |
If you really want to geek out, you can look at the server room power consumption when you're running your job. |
||
− | [http://graph.math.cornell.edu:3000/d/_wfSvpjmz/server-room-power Server Room Power] |
+ | [http://graph.math.cornell.edu:3000/d/_wfSvpjmz/server-room-power Server Room Power] (You must be on the Cornell network or VPN to use this link) |
The purple circuit is the cluster. The power usage is around 650W when nothing is going on. It will go up when the machines get busy. Note that if the power on the purple circuit goes above 2700 watts, the cluster may shut down. I'm worried about going over this limit, but so far it hasn't happened. Can you send the cluster enough math to kill it? Try it and let me know! The solution is easy, we can put some of the nodes on another circuit. For now they're all on the same one so we can more easily measure the power usage of the cluster. |
The purple circuit is the cluster. The power usage is around 650W when nothing is going on. It will go up when the machines get busy. Note that if the power on the purple circuit goes above 2700 watts, the cluster may shut down. I'm worried about going over this limit, but so far it hasn't happened. Can you send the cluster enough math to kill it? Try it and let me know! The solution is easy, we can put some of the nodes on another circuit. For now they're all on the same one so we can more easily measure the power usage of the cluster. |
Revision as of 11:03, 21 July 2022
Launching Remote Kernels in Mathematica
This page assumes that you've already followed the instructions in Cluster SSH access and you have all of that working.
There are other ways to set this up, including lots of menus in Mathematica for the remote kernel settings. The instructions on this are vague and confusing, so for now we're just entering the information directly in the Mathematica code window, because it's simple and it works. Any suggestions for better ways to do this are welcome, and when we find better way to do it, we'll update the documentation.
Also, since this is written by someone who is not very familiar with Mathematica syntax, the large 'LaunchKernels' command for 40 nodes shown below could be done much more easily, for instance, using a loop to generate the string for the machine list passed to the LaunchKernels command, or some other loop that will do the same thing without having to cut and paste that giant command.
Here are the instructions:
On your Linux desktop, start up Mathematica.
We'll start by launching kernels on two of the other cruncher machines, since this is a simple example to start with.
For this example, we're going to start 4 kernels each on ramsey and fibonacci.
In the Mathematica window, type the following:
LaunchKernels[{"ssh://ramsey/?4","ssh://fibonacci/?4"}]
Then, right-click that cell and choose 'Evaluate Cell'. The system will launch the remote kernels, and show its progress while it does that. Once all the kernels are launched, you can run calculations on them. To close those kernels, do
CloseKernels[]
and evaluate the cell. The system will close the remote kernels. Please remember to do a CloseKernels[] at the end of your session. Closing Mathematica SHOULD close the remote kernels, but it's better to be sure.
Now, we'll launch four kernels each on the 40 node cluster. There's a better way to do this than copying and pasting this big command, like using a loop to generate the node names, but for now this works. Note that it will take a while to launch all of the kernels because it launches them one at a time, and this command should give you 160 remote kernels.
LaunchKernels[{"ssh://pnode01/?4","ssh://pnode02/?4","ssh://pnode03/?4","ssh://pnode04/?4","ssh://pnode05/?4","ssh://pnode06/?4","ssh://pnode07/?4","ssh://pnode08/?4","ssh://pnode09/?4","ssh://pnode10/?4","ssh://pnode11/?4","ssh://pnode12/?4","ssh://pnode13/?4","ssh://pnode14/?4","ssh://pnode15/?4","ssh://pnode16/?4","ssh://pnode17/?4","ssh://pnode18/?4","ssh://pnode19/?4","ssh://pnode20/?4","ssh://pnode21/?4","ssh://pnode22/?4","ssh://pnode23/?4","ssh://pnode24/?4","ssh://pnode25/?4","ssh://pnode26/?4","ssh://pnode27/?4","ssh://pnode28/?4","ssh://pnode29/?4","ssh://pnode30/?4","ssh://pnode31/?4","ssh://pnode32/?4","ssh://pnode33/?4","ssh://pnode34/?4","ssh://pnode35/?4","ssh://pnode36/?4","ssh://pnode37/?4","ssh://pnode38/?4","ssh://pnode39/?4","ssh://pnode40/?4"}]
Copy this command from this page and paste it into Mathematica. Note that if you're using X2go and you copy on the local machine, you may need to paste using the 'Paste' command under the 'Edit' menu, because different system handle cutting and pasting differently.
Once you have pasted that big command in there, right-click the cell and evaluate it. You'll see the progress as all of the remote kernels are started up. Once it's done, you'll have 160 remote kernels.
When you are finished using them, do a
CloseKernels[]
so the system can go and shut them all down properly.
Tips
Note that there is nothing stopping you from running remote kernels on both the cluster nodes and the crunchers, but this is not recommended. These machines have cores that are very different speeds, so the faster machine may end up waiting for the slower machines to finish, so this may just be a waste of resources. It's best to run things on the cluster, OR the crunchers, but not both.
The number of kernels you run on each machine may give you different outcomes depending on your job. So, if you run 8 kernels per node, maybe your job will run faster, maybe not. It's best to experiment with a subset of your job to find the optimum number of kernels per node for your job.
For the cluster nodes, each one has 8 cpu threads, so ideally if no one else is using the cluster, your job should load up each node to a load of 8 so that you're making full use of each node's CPU. If it's less than 8, you're not using the whole CPU, and if it's more than 8, some of your kernels are waiting for resources.
You can see the status of your the cluster nodes here:
Cluster Pnodes (You must be on the Cornell network or VPN to use this link)
If you're running remote kernels on the Ryzen crunchers, they have 32 CPU threads each, so you would want to load them up to 32 to make full use of the CPU, if it was zero before you started. Note that other people are using these machines, so be a good neighbor and don't hog the entire machine.
The status of the crunchers is here:
Crunchers (You must be on the Cornell network or VPN to use this link)
Power Consumption
If you really want to geek out, you can look at the server room power consumption when you're running your job.
Server Room Power (You must be on the Cornell network or VPN to use this link)
The purple circuit is the cluster. The power usage is around 650W when nothing is going on. It will go up when the machines get busy. Note that if the power on the purple circuit goes above 2700 watts, the cluster may shut down. I'm worried about going over this limit, but so far it hasn't happened. Can you send the cluster enough math to kill it? Try it and let me know! The solution is easy, we can put some of the nodes on another circuit. For now they're all on the same one so we can more easily measure the power usage of the cluster.
Note the room temperature! The room has a large cooler that uses Cornell Lake-source cooling to cool the room, but your job will still probably warm up the room by a few degrees.
The blue circuit is the Ryzen GPU cruncher machines. That circuit also has a limit of 2700 watts total. There should be enough headroom there to be able to load up all of those machines including their GPUs without overloading the circuit. Again, you might be able to make it shut down. That's ok, as long as you tell the system administrator. It's better to cause this problem now so we can rearrange things to avoid it in the future. But, so far, there seems to be enough capacity on this circuit to handle all of the cruncher machines.
Network Bottlenecks
The crunchers are on a 10Gb/s network. The cluster nodes are each 1Gb/s, but their link as a group to the main network is 10Gb/s. There is a limitation at this time because the crunchers and the cluster are on different subnets, and the connection between the subnets is limited to 1G/s at this time. This should not impact your job, but if you go to crunchers and look at the machine where you are running your main Mathematica, and check the 'Network' graph, if that machine is maxing out at 1Gb/s during your job, then your job is being constrained by the subnet bottleneck. If this is happening, please email humphrey@cornell.edu and let me know about this. The bottleneck can be removed but we haven't gotten to that yet.