Tags

Smartmontools – Comprehensive Monitoring Guide

Overview

Smartmontools is an essential tool for monitoring SSDs and HDDs in modern server environments. It identifies early signs of hardware degradation, provides detailed diagnostics, and integrates seamlessly into automated maintenance workflows. This guide delivers a production‑ready configuration, clear threshold recommendations, structured troubleshooting, operational best practices, and an automated daily health‑check script for the server at1.


1. Recommended smartd.conf (NVMe + SATA)

# NVMe (Samsung 980)
/dev/nvme0 -a -d nvme -n standby -m root -M diminishing -W 4,45,55 -s (S/../.././03|L/../01/./04)

# SATA SSDs (sda/sdb)
/dev/sda -a -n standby -m root -M diminishing -W 4,45,55 -s (S/../.././02|L/../01/./05)
/dev/sdb -a -n standby -m root -M diminishing -W 4,45,55 -s (S/../.././02|L/../01/./06)

Key Option Summary

  • -a – monitor all SMART attributes
  • -d nvme – NVMe driver mode
  • -n standby – avoids waking idle drives
  • -m root – send alerts to the root mailbox
  • -M diminishing – suppresses repeated notifications
  • -W 4,45,55 – temperature thresholds
  • -s (…) – scheduled tests (weekly + monthly)

2. Recommended SMART Thresholds for 24/7 Operation

Note: Many SMART attributes are vendor‑specific. Interpret them cautiously and observe long‑term trends rather than isolated values.

Temperature Guidelines

  • Warning: 45°C
  • Critical: 55°C
  • Ideal NVMe range: 40–50°C

NVMe Wear Indicators

  • Percentage Used > 80% → plan replacement
  • Media/Data Integrity Errors > 0 → critical
  • Critical Warning != 0 → immediate replacement

SATA Attributes That Matter

  • Reallocated_Sector_Ct > 0 → monitor closely
  • Uncorrectable Error Count > 0 → serious issue
  • CRC_Error_Count > 0 → often cable‑related

3. Monitoring Strategy for Continuous Server Operation

Tip: Test email alerts via smartctl -M test -m root /dev/sda to ensure notifications arrive correctly.

smartd (Automatic Monitoring)

  • Performs continuous attribute monitoring
  • Executes scheduled tests
  • Sends pre‑fail warnings

Automated Daily Health Check

Covers:

  • SMART summaries
  • Btrfs device statistics
  • Btrfs scrub status
  • Filesystem usage
  • Recent high‑priority journal logs

Monthly Manual Long Tests

smartctl -t long /dev/nvme0
smartctl -t long /dev/sda
smartctl -t long /dev/sdb

Log Integration

  • System logs: /var/log/syslog
  • Service logs: journalctl -u smartmontools.service

4. Diagnostic Guide (Handling SMART Warnings)

SMART warnings typically appear as pre‑failure flags, temperature alerts, or steadily increasing error counters. The following list highlights the most relevant indicators.

Reallocated_Sector_Ct > 0

Indicates physical degradation.

  • Action: plan replacement.

Current_Pending_Sector > 0

Read‑error candidates waiting for reallocation.

  • Action: run a long test and ensure backups.

CRC_Error_Count Increasing

Usually a cable or connection problem.

  • Action: replace/reattach cable.

NVMe Media/Data Integrity Errors > 0

Critical reliability failure.

  • Action: back up immediately and replace.

Temperature Warnings

  • Improve airflow
  • Adjust NVMe heatsink
  • Remove dust

5. Best Practices for Reliable 24/7 Operation

  • Use realistic SMART thresholds
  • NVMe drives do not require daily testing
  • Keep temperatures below 55°C
  • Run a monthly Btrfs scrub
  • Replace SSDs at 80%+ wear
  • Use RAID1 + Btrfs for short‑term resilience

6. Automated Health‑Check Script for at1

Tip: Run /root/scripts/at1_healthcheck.sh manually for testing. Logs are stored in /var/log/healthcheck/.

File: /root/scripts/at1_healthcheck.sh

#!/bin/bash
# Script Version: 01
# Description: Daily health check for at1 (SMART, Btrfs, disk usage, basic log summary)

set -euo pipefail

# Set variables
# ========
LOG_DIR=/var/log/healthcheck
LOG_FILE="${LOG_DIR}/health_$(date +%F).log"
HOSTNAME=$(hostname)
BTRFS_MOUNT=/mnt/raid1
NVME_DEV=/dev/nvme0
SATA1_DEV=/dev/sda
SATA2_DEV=/dev/sdb

DEBUG=1

log() {
  local LEVEL="$1"; shift
  echo "$(date '+%F %T') [${HOSTNAME}] [${LEVEL}] $*" | tee -a "${LOG_FILE}"
}

_debug() {
  if [ "${DEBUG}" -eq 1 ]; then
    log "DEBUG" "$*"
  fi
}

# Functions
# ========
init_log() {
  mkdir -p "${LOG_DIR}"
  touch "${LOG_FILE}"
}

check_smart() {
  log INFO "==== SMART Status (${NVME_DEV}, ${SATA1_DEV}, ${SATA2_DEV}) ===="
  for DEV in "${NVME_DEV}" "${SATA1_DEV}" "${SATA2_DEV}"; do
    if [ -b "${DEV}" ]; then
      _debug "Checking SMART for ${DEV}"
      echo "---- ${DEV} ----" | tee -a "${LOG_FILE}"
      smartctl -H -A "${DEV}" 2>&1 | tee -a "${LOG_FILE}"
    else
      log WARN "Device ${DEV} does not exist, skipping"
    fi
  done
}

check_btrfs_device_stats() {
  log INFO "==== Btrfs Device Stats (${BTRFS_MOUNT}) ===="
  if mount | grep -q " ${BTRFS_MOUNT} "; then
    btrfs device stats "${BTRFS_MOUNT}" 2>&1 | tee -a "${LOG_FILE}"
  else
    log WARN "${BTRFS_MOUNT} is not mounted, skipping device stats"
  fi
}

check_btrfs_scrub_status() {
  log INFO "==== Btrfs Scrub Status (${BTRFS_MOUNT}) ===="
  if mount | grep -q " ${BTRFS_MOUNT} "; then
    btrfs scrub status "${BTRFS_MOUNT}" 2>&1 | tee -a "${LOG_FILE}"
  else
    log WARN "${BTRFS_MOUNT} is not mounted, skipping scrub status"
  fi
}

check_df() {
  log INFO "==== Filesystem Usage (df -h) ===="
  df -h | tee -a "${LOG_FILE}"
}

check_journal_errors() {
  log INFO "==== Journal Summary (last 15 minutes, priority <= 3) ===="
  journalctl --since "15 min ago" -p 0..3 -o short-precise 2>/dev/null | tee -a "${LOG_FILE}" || true
}

# Main Process
# ========
init_log
log INFO "Starting health check for ${HOSTNAME}"

check_smart
check_btrfs_device_stats
check_btrfs_scrub_status
check_df
check_journal_errors

log INFO "Health check completed successfully"

Your monitoring setup is now clean, consistent, and ready for long‑term reliable operation.