At the Hadrian Hotel

At the Hadrian Hotel

Wednesday, October 22, 2008

Cluster Node-Locking with Torque and Maui

These are mostly notes to myself so that I can figure out how to do this more quickly next time...

We needed to add some nodes to a Rocks 4.1 cluster where members of a particular lab were to have exclusive use of the nodes for a period of time. So, we had to find a way to allow these folks to submit jobs that would run only on the new nodes and to also prevent anybody else from running on the nodes. We chose a belt-and-suspenders approach using features of both Torque (PBS) and Maui.

Previously, we had a single "default" queue for all users of this cluster. We added a "vision" queue for the users of the new machines so that they would be able to explicitly request that their jobs run on the new hardware. This queue specifies ACLs for the node list as well as the users allowed to submit jobs to the queue. In addition, there is a "neednodes" resource specified that gives Maui a clue as to where any jobs in this queue can be run. Here are the commands we ran to set up the queue:

qmgr -c "create queue vision queue_type=execution"
qmgr -c "set queue vision resources_default.neednodes = vision"
qmgr -c "set queue vision acl_hosts=compute-0-22+compute-0-23+compute-0-24"
qmgr -c "set queue vision acl_host_enable = false"
qmgr -c "set queue vision acl_users=user1"
qmgr -c "set queue vision acl_users+=user2"
qmgr -c "set queue vision acl_users+=user3"
qmgr -c "set queue vision acl_user_enable=true"
qmgr -c "set queue vision enabled = True"
qmgr -c "set queue vision started = True"

The acl_host_enable = false setting causes Torque to use the acl_hosts list as nodes on which jobs should be queued, rather than as nodes that can run the qsub command. Note that there does not appear to be a way to set multiple acl_users in a single command. While a "list queue" command will show the users in a comma-separated list, if you try to set the ACL that way you get a syntax error. The same can be said for the method of using a plus sign as is done for the hosts ACL.

In addition to setting up the vision queue, a change was needed for the default queue and to the Torque nodes file which, in our case, was /opt/torque/server_priv/nodes but generically would be found at $TORQUE_HOME/server_priv/nodes. We added a "neednodes" resource to the default queue as we did for the vision queue:
qmgr -c "set queue default resources_default.neednodes = general"


For each of the 3 new machines, we appended the word "vision" to the line defining the node like so:
compute-0-22.local np=4 vision

For the rest of the nodes in the file, we added the word "general" like so:
compute-0-0.local np=4 general

After restarting the pbs_server and maui daemons, the end result was that anybody could submit jobs to the default queue and they would run on any node except the 3 nodes dedicated to the vision lab. Only specific users could submit jobs to the vision queue and those jobs would only run on the 3 new machines. This is just what we were looking for. If we ever want to allow everybody to use the new nodes from the default queue, I believe that it should be as simple as appending the word "general" to the "vision" nodes in the server_prive/nodes file.

Technorati Tags: , , , ,