In this article I list metrics and alerts one should have when monitoring a GPU cluster to ensure efficient utilization of resources.

GPU cluster monitoring is critical for organizations to optimally utilize the limited capacity they have.
Without monitoring it is easy for users to leave jobs running that do not use GPU resources, or do not use them efficiently.
In some cases GPU clusters use certain technologies that require the users to provide images with specific libraries, and not including those dependencies can result in significantly worse compute performance.

  • Allocated GPUs
    • Used to determine who (or which project) has GPU allocated (i.e., currently assigned to a running workload)
  • GPU utilization
    • Used to determine whether the GPU is partially or fully used, and if it is partially used, to potentially identify the causes
  • GPU memory utilization
    • Used to determine if the GPU memory is partially or fully used
    • Used to identify out of memory issues and potential memory leaks
  • InfiniBand receive/transmit bytes
    • Used to determine if a workload is making use of the technology
  • Job launch wait duration
    • Used to determine when there's queueing of jobs due to compute being exhausted and how long it takes for jobs to start
  • Job duration
    • Used to gather statistics about the type of workload running on the cluster in order to make informed decisions

  • Allocated GPUs are used
    • Used to detect jobs that may ask multiple GPUs but end up using 1 or only a few of them
  • GPU utilization below threshold (<10%)
    • Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
  • GPU utilization above threshold (>90%)
    • Used to detect when the GPU is saturated
  • GPU utilization range above threshold (>25%)
    • Used to detect uneven distribution of GPU compute workload
  • GPU memory utilization below threshold (<10%)
    • Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
  • GPU memory utilization above threshold (>95%)
    • Used to detect when a job is about to run out of GPU memory
  • InfiniBand receive/transmit > 0 when running multi-node workloads
    • Used to identify workloads that are not properly configured to use InfiniBand

All articles on this blog originate from my mind.
Most articles are written by me, but some are partially or entirely AI/LLMs‑generated.

Those articles will be tagged accordingly:

  • No tag for completely original content.
  • partially-ai-generated for articles with one or many AI-generated sentences or with some feedback provided by AI.
    This covers articles where there is 1 word changed by AI to the article being almost entirely written by AI but with some human input.
  • fully-ai-generated when all the content is AI-generated.
    This covers articles that are entirely written by AI without any human input (except for possibly removing sentences).

I also use additional tags in relation to AI usage, namely:

  • ai-feedback for articles that were edited following AI feedback.

I tag the articles with the LLMs that were involved.
Look for tags starting with llm=.

I use a variety of LLM providers (in order of frequency of use):

  • It's important to know what your goals are.
  • It's important to understand why they are your goals.
  • It's important to determine which goals are more important than others (goals priority).
  • It's important to know which goals are dependent on other goals (goals decomposition and dependency).
  • To reach a goal, you must first acquire the tools (knowledge, resources) to get to your objective.
  • It's important to know when to drop/abandon goals.

  • Sources of inefficiency

    • Repeating the same task without sufficient experience.
  • Always try to figure out the most optimal path toward a goal

    • Observe others successful at achieving the goal you want to achieve.
    • Determine the differences between your state and theirs (what they know, what resources are available to them, etc.).
  • How to determine when it is not possible to reach a goal at a given moment in time?
    • Not enough time available
    • Too costly
    • Dependencies not resolved/ready

The workstack is a very simple idea I had while working. It is based on the concept of a stack as the name clearly implies. As you work, you, like a computer, process things one at a time and as new things need to be done, you either throw them in a todo list (a queue), or you start doing them right away (you stack them).

The workstack is a way to record notes about what you work on. As you work on something, you can either work on them to completion, or be interrupted by the necessity of working on another task. In the first case, tasks are simply written one after the other with their begin and end time. In the second case, items are also indented, such that it is possible to observe when a task forced you to "switch context".

An example of this note taking format is as follow.


2018-05-18
Task 1 10:00-10:30
Task 2 10:35-10:50
Task 3 11:00-...
    Task 4 11:05-11:15
    Task 6 11:17-...
        Task 7 11:20-...
Task 5 (not begun)

In this case, the person started working on tasks 1 and 2, then began working on task 3. As he began his work, he noticed that something else was necessary, which spawned task 4. While he was working on task 4, he observed something that could be done, but didn't have to be done right away, which spawned task 5. As he completed task 4, he returned to task 3, but noticed that something else also had to be done, which effectively spawned task 6. During task 6, something else also interrupted him, which forced him to work on task 7. In this case, it could have been a coworker asking you for help on something. Task 5 could be a coworker asking for help as soon as you're available, but not wanting to interrupt you.

Conceptually, you would want to always complete a stack of operations before moving to a new task. However, it is highly common while programming that a programmer will start going down such stack while working on code and then will not end up climbing back the stack, effectively not completing all he started working on.

This format thus allows a programmer (or anyone working on tasks that can spawn other tasks) to better track what they were doing and what they did and did not complete.

  • Build a list of task/items
    List everything that you want to get out of your head. The goal here is to make explicit as much as possible.
  • Deconstruct tasks into their pre-requisites and follow-up tasks
    There are a couple of important things to consider when one wants to prioritize their task list. One is that even if a task is at the top of the list, it might not be possible to do it until its dependencies are fulfilled. This in turn means that all dependencies will have a superior priority to this task automatically.
    However, it frequently happens that what we consider dependencies can in fact be delayed or temporarily replaced by another solution which takes less time to implement or costs less (or for whatever other reason can replace the original dependency).
  • Split tasks into 2 groups (and repeat this process)
    The idea here is to quickly filter out as many tasks as possible. As you may have noticed, I have not specified the filtering predicate. It is up to you to filter out your tasks such that you will have the least amount to filter at once. Examples of predicates you could use are "will/will not do", "want/do not want", "need/do not need", "like/do not like" and so on.
  • Prioritize the tasks that will have to be done
    After a certain number of iterations of the previous step, you should arrive at a point where the items you have all need to be done, but you do not know in which order you have to do them (or want to do them).

This method, also known as the Eisenhower Matrix, involves categorizing tasks into four groups: Urgent/Important, Not Urgent/Important, Urgent/Not Important, and Not Urgent/Not Important. By sorting tasks this way, you can focus on what truly matters and avoid spending time on less critical activities.

The Analytic Hierarchy Process (AHP) is a structured technique for organizing and analyzing complex decisions. It involves breaking down a problem into its components, comparing them pairwise, and assigning weights to determine the relative priority of each task.

Using a binary search tree for prioritization means inserting tasks based on their priority value, allowing for efficient retrieval and reordering. This approach is useful for dynamically managing and updating a list of tasks as priorities change.

The Planning Game is a collaborative method often used in agile development, where stakeholders and team members estimate and prioritize tasks together. It encourages discussion, negotiation, and consensus to determine which tasks should be tackled first.

In the 100-Point Method, each participant is given 100 points to distribute among a list of tasks or requirements according to their perceived importance. The tasks with the highest total points are prioritized, reflecting the collective preferences of the group.

  • What you feel like reading
  • Reading dependencies
  • ROI evaluation

  • Karlsson, Joachim, Claes Wohlin, and Björn Regnell. "An evaluation of methods for prioritizing software requirements." Information and Software Technology 39.14 (1998): 939-947.
  • Karlsson, Joachim, Stefan Olsson, and Kevin Ryan. "Improved practical support for large-scale requirements prioritising." Requirements Engineering 2.1 (1997): 51-60.
  • Ahl, Viggo. "An experimental comparison of five prioritization methods: investigating ease of use, accuracy and scalability." (2005).
  • Gill, Nasib Singh. "A Comparison among Various Techniques to Prioritize the Requirements." International Journal of Computer Science and Management Studies (IJCSMS) www. ijcsms. com 1.12: 601-607.
  • http://www.gwern.net/Resorter