Cluster Monitoring in Python with Glances

Some years ago my wife bought me a few (6) Raspberry Pi 3 B+ single board computers for Christmas so that I could experiment with creating and managing clusters of computers (it’s what I’d asked for). Since then, I’ve discovered better ways to do high-performance computing with a few higher-end PCs that I use regularly. Thus, my Raspberry Pi cluster largely sits dormant performing only a few small tasks (primarily gathering data for my stock price API and acting as an SSH access point for my home network). Before this, I had used the Raspberry Pis for many tasks and always found myself wondering the same thing, “Are those things still running?” Because of the case I was using I couldn’t physically see the devices to verify the power and processing lights were on or flashing so I really had no indication of their status. Instead of doing rudimentary things like pinging each Pi or attempting to log into each node, I decided to implement a more accessible solution. In this post, I describe the setup and usage of the Glances API and the displaying of certain data on a small LCD screen mounted to the outside of the case I used to store my Pi cluster.

Glances

Glances is a cross-platform monitoring tool written in Python. The API provides a ton of information about what’s going on on the server running the software. What I’ve found most useful is running Glances in web server mode which provides the information via a nifty web interface. Running Glances in this mode listens for connections on port 61208 (by default) and essentially gives information similar to what you’d see via top or htop with a few other niceties. Running Glances in web server mode and navigating to the URL (i.e. http://<server ip>:61208) provides the following output:

This is nice but I had planned on running this software on multiple devices and wanted to be able to determine their availability. Two problems arise, i) having to remember the IP addresses for many devices and ii) having to check (essentially) multiple websites whenever we want to determine device availability.

Fortunately, Glances also offers a RESTful API to retrieve the information displayed above (and more). In my case, I was interested in knowing if a node in my Raspberry Pi cluster was available and what the total cluster’s resource utilization was. Therefore, I needed to be able to ping the nodes, get each node’s CPU usage, and get each node’s memory (RAM) usage. This is easily done when running Glances in web server mode via the API endpoints made available. The base URL is:

http://<server ip>@<glances port>/api/3/<plugin>

where

  • server ip is the IP address of the Glances server
  • glances port is the port Glances is listening for connections on
  • and plugin, in my case, is replaced with cpu to get the CPU usage details or mem to get the memory usage details

The command below is used to start glances in web server mode on each node in the cluster. The -w command line option is to specify web server mode while -p chooses the port. Although the default port is used, I also specify it via the command line in case the default was to change in the future.

glances -w -p 61208

To make it easier I just added this to the /etc/rc.local file which runs on startup for Raspbian. However, more recent Ubuntu/Debian-based distributions have done away with the rc.local file so another approach is required (cron jobs, services, etc.)

Gather CPU and Memory Usage

Two functions were used to fetch the memory and CPU usage, although both do more or less the same thing. The first, displayed below, uses the Glances API to get details about the amount of RAM available on the queried device, the amount being used, and the percentage of RAM being used.

First, the following libraries are imported and variables are defined:

import requests
#import urllib.request
import socket
import json
from i2cdriver import lcd
from datetime import datetime
from pytz import timezone

BASE_IP = '192.168.0.'
## Some splainin': the '000' address isn't a real pi. For some reason, the 
## requests library hangs on the last request, no matter which IP it is. 
## That is, if we stop at 108 or 178 or 151 the program just stops.
## Including an bad request, catching the error, then continuing allows
## us to reliably get the data every time.
SERVER_IPS = ['147', '104', '108', '151', '119', '161', '000']
PORT_NUMBER = 61208

The function to fetch memory data is given as:

def get_mem(online):
    avail = [0 for _ in range(len(SERVER_IPS))]
    used = [0 for _ in range(len(SERVER_IPS))]
    percent = [0 for _ in range(len(SERVER_IPS))]
    count = 0;
    for ip in SERVER_IPS:
        if online[count] == 1:
            url = base_url.format(ip=BASE_IP+ip, plugin='mem', port=PORT_NUMBER)
            try:
                response_json = requests.get(url, headers={'Connection': 'Close'}, stream=True, timeout=5).json()
            except Exception as err:
                print(err)  # print just in case it's real
            avail[count] = response_json['available']
            used[count] = response_json['used']
            percent[count] = response_json['percent']
        count += 1
    return avail, used, percent

Here the IP addresses for each Pi in the cluster are iterated and a URL is built to submit a GET request to the Glances API. The online parameter is a list that stores a “1” if the node is online and a “0” otherwise. The data returned from the API is stored in lists and returned from the function. Note that I’ve given each Raspberry Pi in the cluster a static IP address via my router’s DHCP settings so I won’t need to update the list (they were all sequential at a time, I don’t know what happened between now and then).

Next, in the function below, the Glances API is queried to retrieve the CPU usage for each Pi.

def get_cpu(online):
    cpu_values = [0 for _ in range(len(SERVER_IPS))]
    count = 0;
    for ip in SERVER_IPS:
        if online[count] == 1:
            url = base_url.format(ip=BASE_IP+ip, plugin='cpu', port=PORT_NUMBER)
            try:
                response_json = requests.get(url, headers={'Connection': 'Close'}, stream=True, timeout=5).json()
            except Exception as err:
                print(err)  # print just in case it's real
            cpu_values[count] = response_json['total']   # percent being used
        count += 1
    return cpu_values

This function is very similar to the get_mem() function, the only difference is that the percent usage of the Pis’ CPUs is returned instead.

Below is the main function of the monitoring script which first checks if the Raspberry Pis are running and connected to the internet via ping, then calls the function above to get the usage data, and finally makes the data a little more readable and writes it to the screen.

if __name__ == '__main__':
    # docs: https://github.com/nicolargo/glances/wiki/The-Glances-RESTFULL-JSON-API
    base_url = 'http://{ip}:{port}/api/3/{plugin}'
    mem = 'mem'
    cpu = 'cpu'

    ## check which pis are still online via ping
    online = [0 for _ in range(len(SERVER_IPS))]
    count = 0
    for ip in SERVER_IPS:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        addr = BASE_IP + ip
        online[count] = (1 if sock.connect_ex((addr, PORT_NUMBER)) == 0 else 0)
        count += 1 

    ## get stats from online pis
    avail, used, pct = get_mem(online)
    cpu_values = get_cpu(online)
    
    to_gb = 1000000000
    div = sum(online)
    avg_cpu = sum(cpu_values) / div
    avg_mem_pct = sum(pct) / div 

    avg_mem_avail = sum(avail) / to_gb
    avg_mem_used = sum(used) / to_gb

    print("Pi Avail/Total: {}/{}".format(sum(online), len(SERVER_IPS)-1))
    print(SERVER_IPS)
    print(online)
    print("Online: {}".format(sum(online)))
    print("CPU: {:.2f}%".format(avg_cpu))
    print("RAM: {:.2f}GB/{:.2f}GB".format(avg_mem_used, avg_mem_avail, avg_mem_pct))
    print("RAM Percent: {:.2f}%".format(avg_mem_pct))                                                                                                                                                                                                                                                                                                            

The code provided here will display lines of text showing

  • the number of computers available vs the total number in the cluster,
  • the IP addresses of the cluster nodes,
  • a list of values with a “1” indicating the node is available and a “0” indicating it is not,
  • the average CPU usage amongst all nodes (as a percentage),
  • the total RAM available compared to the amount RAM being used,
  • and the average percentage of the RAM being used.

This is a nifty script in and of itself but it still needs to run continuously to constantly provide updated values. In my case, I wanted to passively monitor the Pis in the cluster. Because of this, I ended up purchasing a small 20 character, 4 line I2C LCD display and hooking it up to an old Raspberry Pi 2. The code to make this work is provided below.

First I needed to rip off an I2C interface driver for the Raspberry Pi. This one was taken from this GitHub page and used more or less as-is. The file of interestest is RPi_I2C_driver.py. Since the file is almost 200 lines long I will exclude it from this post so it doesn’t get too long and because I didn’t write the script or change it in any way (at least that I remember).

To write to the LCD screen using the driver discussed above the following code is added after the print statements in the main function above.

# hook up LCD I2C and print results (20x4 screen)
my_lcd = lcd()
time = datetime.now(timezone('America/Denver'))
#my_lcd.lcd_display_string(time.strftime("%b %d, %Y %I:%M"), 1)
my_lcd.lcd_display_string("147 104 108 151 119 161", 1)
#my_lcd.lcd_display_string("Pi Avail/Total: {}/{}".format(sum(online), len(SERVER_IPS)-1), 2)
my_lcd.lcd_display_string(str(online[:-1]).replace(' ', ''), 2)
#    my_lcd.lcd_display_string("Online: {}".format(sum(online)), 2)
my_lcd.lcd_display_string("CPU:{:.2f}% ".format(avg_cpu) + "Time " + time.strftime("%I:%M"), 3)
my_lcd.lcd_display_string("RAM: {:.2f}GB/{:.2f}GB".format(avg_mem_used, avg_mem_avail, avg_mem_pct), 4)

The text is somewhat condensed and not practical for large numbers of nodes since the LCD screen is so small but for me, it did the trick just fine.

Conclusion

In sum, the Glances library is a very powerful monitoring tool that can be run almost anywhere and provides great insight into the current status of remote or headless machines. The ability to run in web server mode, coupled with the fact that doing so opens up a RESTful API makes monitoring many nodes in a cluster very easy and requires just a few GET requests and some JSON parsing. Note that I’m using a very small subset of the data provided from these GET requests (primarily so it fits on the LCD screen) but much more is available and can be found in the Glances RESTful documentation. I built a custom case for my cluster out of an old small form-factor PC case and cut a hole to mount the LCD screen. Now, by just glancing at the LCD (maybe that’s how the library was named), I can gain a lot of valuable insight into the status of my cluster.

Full Code

Please use the links above to find the Raspberry Pi I2C driver code.

monitor.py

import requests
#import urllib.request
import socket
import json
from i2cdriver import lcd
from datetime import datetime
from pytz import timezone

BASE_IP = '192.168.0.'
## Some splainin': the '000' address isn't a real pi. For some reason, the 
## requests library hangs on the last request, no matter which IP it is. 
## That is, if we stop at 108 or 178 or 151 the program just stops.
## Including an bad request, catching the error, then continuing allows
## us to reliably get the data every time.
SERVER_IPS = ['147', '104', '108', '151', '119', '161', '000']
PORT_NUMBER = 61208

def get_mem(online):
    avail = [0 for _ in range(len(SERVER_IPS))]
    used = [0 for _ in range(len(SERVER_IPS))]
    percent = [0 for _ in range(len(SERVER_IPS))]
    count = 0;
    for ip in SERVER_IPS:
        if online[count] == 1:
            url = base_url.format(ip=BASE_IP+ip, plugin='mem', port=PORT_NUMBER)
            try:
                response_json = requests.get(url, headers={'Connection': 'Close'}, stream=True, timeout=5).json()
            except Exception as err:
                print(err)  # print just in case it's real
            avail[count] = response_json['available']
            used[count] = response_json['used']
            percent[count] = response_json['percent']
        count += 1
    return avail, used, percent

def get_cpu(online):
    cpu_values = [0 for _ in range(len(SERVER_IPS))]
    count = 0;
    for ip in SERVER_IPS:
        if online[count] == 1:
            url = base_url.format(ip=BASE_IP+ip, plugin='cpu', port=PORT_NUMBER)
            try:
                response_json = requests.get(url, headers={'Connection': 'Close'}, stream=True, timeout=5).json()
            except Exception as err:
                print(err)  # print just in case it's real
            cpu_values[count] = response_json['total']   # percent being used
        count += 1
    return cpu_values

if __name__ == '__main__':
    # docs: https://github.com/nicolargo/glances/wiki/The-Glances-RESTFULL-JSON-API
    base_url = 'http://{ip}:{port}/api/3/{plugin}'
    mem = 'mem'
    cpu = 'cpu'

    ## check which pis are still online via ping
    online = [0 for _ in range(len(SERVER_IPS))]
    count = 0
    for ip in SERVER_IPS:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        addr = BASE_IP + ip
        online[count] = (1 if sock.connect_ex((addr, PORT_NUMBER)) == 0 else 0)
        count += 1 

    ## get stats from online pis
    avail, used, pct = get_mem(online)
    cpu_values = get_cpu(online)
    
    to_gb = 1000000000
    div = sum(online)
    avg_cpu = sum(cpu_values) / div
    avg_mem_pct = sum(pct) / div 

    avg_mem_avail = sum(avail) / to_gb
    avg_mem_used = sum(used) / to_gb

    print("Pi Avail/Total: {}/{}".format(sum(online), len(SERVER_IPS)-1))
    print(SERVER_IPS)
    print(online)
    print("Online: {}".format(sum(online)))
    print("CPU: {:.2f}%".format(avg_cpu))
    print("RAM: {:.2f}GB/{:.2f}GB".format(avg_mem_used, avg_mem_avail, avg_mem_pct))
    print("RAM Percent: {:.2f}%".format(avg_mem_pct))


    # hook up LCD I2C and print results (20x4 screen)
    my_lcd = lcd()
    time = datetime.now(timezone('America/Denver'))
    #my_lcd.lcd_display_string(time.strftime("%b %d, %Y %I:%M"), 1)
    my_lcd.lcd_display_string("147 104 108 151 119 161", 1)
    #my_lcd.lcd_display_string("Pi Avail/Total: {}/{}".format(sum(online), len(SERVER_IPS)-1), 2)
    my_lcd.lcd_display_string(str(online[:-1]).replace(' ', ''), 2)
#    my_lcd.lcd_display_string("Online: {}".format(sum(online)), 2)
    my_lcd.lcd_display_string("CPU:{:.2f}% ".format(avg_cpu) + "Time " + time.strftime("%I:%M"), 3)
    my_lcd.lcd_display_string("RAM: {:.2f}GB/{:.2f}GB".format(avg_mem_used, avg_mem_avail, avg_mem_pct), 4)

Leave a Reply

Your email address will not be published.