Monday, October 14, 2013

Monitoring Hard Drive Health on Linux with smartmontools

S.M.A.R.T. is a system in modern hard drives designed to report conditions that may indicate impending failure. smartmontools is a free software package that can monitor S.M.A.R.T. attributes and run hard drive self-tests. Although smartmontools runs on a number of platforms, I will only cover installing and configuring it on Linux.

Why Use S.M.A.R.T.?

Basically, S.M.A.R.T. may give you enough of a warning that you can safely backup all your data before your hard drive dies. There is some amount of conflicting information on the internet about how reliable the warnings are. The best source of research that I found is apaper from Google that describes an internal study of hard drive failure. A quick summary: certain events greatly increase the chance of hard drive failure including reallocation events and failed self-tests, but only about 60% of the drives that failed in the study had any negative S.M.A.R.T. attributes. Obviously, nothing replaces regular backups.
A good source for more information is the S.M.A.R.T. wikipedia page.

Installation

On Debian or Ubuntu systems:
$ sudo apt-get install smartmontools
On Fedora:
$ sudo yum install smartmontools

Capabilities and Initial Tests

smartmontools comes with two programs: smartctl which is meant for interactive use and smartd which continuously monitors S.M.A.R.T. Let’s look at smartctl first:
$ sudo smartctl -i /dev/sda
Replace /dev/sda with your hard drive’s device file in this command and all subsequent commands. If there’s only one hard drive in the system, it should be /dev/sda or /dev/hda. If this command fails, you may need to let smartctl know what type of hard drive interface you’re using:
$ sudo smartctl -d TYPE -i /dev/sda
where TYPE is usually one of ata, scsi, or sat (for serial ata). See the smartctl man page for more information. Note that if you need -d here, you will need to add it to all smartctl commands. This should print information similar to:
=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint T133 series
Device Model:     SAMSUNG HD300LJ
Serial Number:    S0D7J1UL303628
Firmware Version: ZT100-12
User Capacity:    300,067,970,560 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Fri Jan  2 03:08:20 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Now that smartctl can access the drive, let’s turn on some features. Run the following command:
$ sudo smartctl -s on -o on -S on /dev/sda
  • -s on: This turns on S.M.A.R.T. support or does nothing if it’s already enabled.
  • -o on: This turns on offline data collection. Offline data collection periodically updates certain S.M.A.R.T. attributes. Theoretically this could have a performance impact. However, from the smartctl man page:
    Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so  in  practice  it has little effect.
  • -S on: This enables “autosave of device vendor-specific Attributes”.
The command should return:
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.
Next, let’s check the overall health:
$ sudo smartctl -H /dev/sda
This command should return:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
If it doesn’t return PASSED, you should immediately backup all your data. Your hard drive is probably failing. Next, let’s make sure that the drive supports self-tests. I have yet to see a drive that doesn’t, but the following command also gives time estimates for each test:
$ sudo smartctl -c /dev/sda
I won’t list the complete output because it’s somewhat lengthy. Make sure “Self-test supported” appears in the “Offline data collection capabilities” section. Also, look for output similar to:
Short self-test routine
recommended polling time:   (   2) minutes.
Extended self-test routine
recommended polling time:   ( 127) minutes.
These are rough estimates of how long the short and long self-test’s will take respectively. Let’s run the short test:
$ sudo smartctl -t short /dev/sda
On my drive, this test should take 2 minutes, but this obviously varies. You can run:
$ sudo smartctl -l selftest /dev/sda
to check results. Unfortunately, there’s no way to check progress, so just keep running that command until the results show up. A successful run will look like:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     21472         -
Now, do the same for the long self-test:
$ sudo smartctl -t long /dev/sda
The long test can take a significant amount of time. You might want to run it overnight and check for the results in the morning. If either test fails, you should immediately backup all your data and read the last section of this guide.

Configuring smartd

We’ve now enabled some features and run the basic tests. Instead of repeating the previous section daily, we can setup smartd to do it all automatically. If your system has an /etc/smartd.conf file, check for a line that begins with DEVICESCAN. If you find one comment it out by adding a ‘#’ to the beginning of the line. DEVICESCAN doesn’t work on my system and specifying a device file is easy. Add the following line to /etc/smartd.conf:
/dev/sda -a -d sat -o on -S on -s (S/../.././02|L/../../6/03) -m root -M exec /usr/share/smartmontools/smartd-runner
Here’s what each option does:
  • /dev/sda: Replace this with the device file you’ve been using in smartctl commands.
  • -a: This enables some common options. You almost certainly want to use it.
  • -d sat: On my system, smartctl correctly guesses that I have a serial ata drive. smartd on the other hand does not. If you had to add a “-d TYPE” parameter to the smartctl commands, you’ll almost certainly have to do the same here. If you didn’t, try leaving it out initially. You can add it later if smartd fails to start.
  • -o on, -S on: These have the same meaning as the smartctl equivalents
  • -s (S/../.././02|L/../../6/03): This schedules the short and long self-tests. In this example, the short self-test will run daily at 2:00 A.M. The long test will run on Saturday’s at 3:00 A.M. For more information, see the smartd.conf man page.
  • -m root: If any errors occur, smartd will send email to root. On my system, mail for root is forwarded to my normal email account. If you don’t have a similar setup, replace root with your normal email address. This option also requires a working email setup. Most Linux distributions automatically have working outbound email.
  • -M exec /usr/share/smartmontools/smartd-runner: This last part may be specific to the Debian and Ubuntu smartmontools packages. Check if your system has /usr/share/smartmontools/smartd-runner. If it doesn’t, remove this option. Instead of sending email directly, “-M exec” makes smartd run a different command when errors occur. On Debian, smartd-runner will run each script in /etc/smartmontools/run.d/, one of which emails the user specified by the “-m” option.
If you have more than one hard drive in your system, add a line for each one replacing /dev/sda with a different device file.
Update on 2009-01-06:
Thanks to commenter robert for pointing out an omission on my part. If your system has the file /etc/default/smartmontools, uncomment the “#start_smartd=yes” line by removing the “#”.
Finally, restart smartd:
$ sudo /etc/init.d/smartmontools restart
If this command fails, the end of /var/log/daemon.log should have some diagnostic information. If smartd started fine, we should still test that email notifications are working. Add “-M test” to the end of the configuration line in /etc/smartd.conf. This will make smartd send out a test notification when it’s next started. Once again, restart smartd:
$ sudo /etc/init.d/smartmontools restart
You should receive an email similar to:
This email was generated by the smartd daemon running on:

   host name: polar
  DNS domain: shadypixel.com
  NIS domain: (none)

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/sda

For details see host's SYSLOG (default: /var/log/syslog).
Afterward, you can delete “-M test”.

What To Do If smartd Detects Problems

First, immediately backup everything. Depending on the error, your drive might be close to death or it may still have a long life ahead. Consult the smartmontools FAQ. It has some recommendations for specific errors. Otherwise, ask for help on the smartmontools-support mailing list.

No comments:

Post a Comment