Elasticsearch Garbage Collector

If you're using Elasticsearch you'll sooner or later have to deal with diskspace issues. The setup I currently have to manage gathers 200 to 300 million docs per 24 hours and a solution was needed to always guarantee enough free diskspace so that Elasticsearch wouldn't fail.

The following bash script is the last incarnation of this quest to have an automatic “ringbuffer”:

#!/bin/bash

  LOCKFILE=/var/run/egc.lock

# Check if Lockfile exists

  if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then
    echo "EGC process already running"
    exit 1
  fi

# Make sure the Lockfile is removed 
# when we exit and then claim it

  trap "rm -f ${LOCKFILE}; exit" INT TERM EXIT

# Create Lockfile

  echo $$ > ${LOCKFILE}

# Always keep a minimum of 30GB free in logdata
# by sacrificing oldest index (ringbuffer)

  DF=$(/bin/df /dev/md0 | sed '1d' | awk '{print $4}')

  if [ ${DF} -le 30000000 ]; then
    INDEX=$(/bin/ls -1td /logdata/dntx-es/nodes/0/indices/logstash-* | tail -1 | xargs -n 1 basename)
    curl -XDELETE "http://localhost:9200/${INDEX}"
  fi

# Check & clean elasticsearch logs 
# if disk usage is > 10GB

  DU=$(/usr/bin/du /var/log/elasticsearch/ | awk '{print $1}')

  if [ ${DU} -ge 10000000 ]; then
    rm /var/log/elasticsearch/elasticsearch.log.20*
  fi

# Remove Lockfile

  rm -f ${LOCKFILE}

  exit 0

Make sure to check/modify the script to reflect your particular setup: It's very likely that your paths and device names are different.

It runs every 10 minutes (as a cron job) and checks the available space on the device where Elasticsearch stores its indices. In this example /dev/md0 is mounted on /logdata. If md0 has less than 30GB of free diskspace it automagically finds the oldest Elasticsearch index and drops it via Elasticsearch's REST API without service interruption (no stop/restart of Elasticsearch required).

A simple locking mechanism will prevent multiple running instances in case of timing issues. All you need is curl for it to work and it will increase your storage efficiency so that you can always have as much past data available as your storage allows without the risk of full disk issues or the hassle of manual monitoring & maintaining.