I've been running a file and media server at home to share stuff with family and friends for many years. As the collection of media has grown, and the configuration gotten more complex, the thought of having to rebuild from a catastrophic failure is taunting.

Like any sensible person would do, I planned for some form of redundancy. After looking at ZFS, brtfs and mdadm, I opted for mdadm. Not for any particular reason, other than having the ability to add disks and grow the array. ZFS raidz + zpool seemed like too much work and I'm lazy. I'd rather a single layer of storage array that I can add disks to without having to pool devs.

There's obviously an off-site backup (yay CrashPlan), but that's there for peace of mind for when shit really hits the fan. Day to day, I would like to avoid dealing with recovering terabytes from a cloud service. It's much quicker to rebuild a RAID array than waiting for everything to download from CrashPlan.

Initially, I simply setup smartmontools to send me an email daily with the health of disks. This didn't prove very effective, as email got lost amongst countless ads for enlarging one's genitals. At one stage my RAID5 array was in limp mode for a week, until I noticed that my pretty Grafana graph for hdd temperatures was missing a drive. Looking further back, I noticed that one of the drive temperatures had a sudden spike, followed by it disappearing from graph. Quick search for the email did indeed show a drive failure.

I wanted to narrow down on the failed drive and see what the integrity of the array was like. Luckily, mdadm comes with a nice report. Simply running mdadm --detail /dev/md0 showed the failed drive and other details of the array's health.

What I really wanted, was a way of knowing the health of a given array at a glance, rather than an email report of frequently less than critical issues, that would often get buried amongs less important stuff in my inbox.


As I'm already using Pushover as the method of sending notifications from many of the applications I use for media management (natively supported by Sonarr, Radarr, SabNZBD and many more), I figured I'll stick to it as it has proven to be very reliable delivering messages to my phone and other destinations. Best part, it allows you to set priority so I don't get interrupted when there's nothing that warrants immediate attention.

As a starting point, I'm only alerting myself on number of failed drives. This can easily be obtained by running mdadm --detail /dev/md0 | awk '/Failed Devices : / {print $4;}'. The priority is set based on the returned value - anything >0 is undesireable and gets bumped to "High" priority. This tells pushover client app to ignore your phone's silent / do not disturb setting and send an audible notification.

The health check code itself is simple, as can be seen below. Feel free to use / modify / contribute. It can also be found on GitHub, as I'm planning on adding features. mdadm --detail returns many more useful stats so this has growth room, but it does the trick for now.

#!/usr/bin/env python
import argparse
import logging
import subprocess
import httplib
import urllib

LOGGER = logging.getLogger('logger')

PO_MSG_ENDPOINT = "/1/messages.json"

def mdadm_check(args):
    for array in args.arrays:
        LOGGER.info('Checking array ' + array)
        cmd = "/sbin/mdadm --detail " + array + " | awk '/Failed Devices : / {print $4;}'"
        check = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
        failed_drives, err = check.communicate()
        LOGGER.info('Found ' + failed_drives + ' failed drives, sending Pushover msg')

        if int(failed_drives) != 0:
            priority = 1
            message = "CRITICAL: There are {} failed drives in {}".format(failed_drives, array)
            post_to_pushover(args.token, args.key, str(priority), message)
        elif int(failed_drives) == 0:
            priority = -1
            message = "INFO: There are {} failed drives in {}".format(failed_drives, array)
            post_to_pushover(args.token, args.key, str(priority), message)

def post_to_pushover(token, key, priority, msg):
        LOGGER.info('Opening HTTPS connection to api.pushover.net...')
        po_api = httplib.HTTPSConnection("api.pushover.net:443")
        po_api.request("POST", PO_MSG_ENDPOINT,
                           "token": token,
                           "user": key,
                           "priority": priority,
                           "message": msg,
                       }), {"Content-type": "application/x-www-form-urlencoded"})
        response = po_api.getresponse()
        LOGGER.info("{}: {}".format(response.status, response.reason))
    except Exception as ex:
        LOGGER.error('Could not connecto to Pushover: ' + str(ex))

if __name__ == '__main__':
    PARSER = argparse.ArgumentParser(description='Simple software RAID health check tool using mdadm and Pushover.')
    PARSER.add_argument('-a', '--array', dest='arrays', action='append', help='RAID array i.e /dev/md0', required=True)
    PARSER.add_argument('-t', '--token', dest='token', help='Pushover App Token', required=True)
    PARSER.add_argument('-k', '--key', dest='key', help='Pushover User Key', required=True)
    ARGS = PARSER.parse_args()
comments powered by Disqus

Automated kennel heating, using Arduino, Raspberry Pi, ESP8266 and HomeAssistant.

Automated kennel heating, using Arduino, Raspberry Pi, ESP8266 and HomeAssistant. Continue reading