Hard drive software that IT administrators use to monitor drive health is highly inconsistent from drive to drive and manufacturer to manufacturer, according to data collected from nearly 40,000 spindles.
The data, released today from cloud service provider Backblaze, also indicated which five of the 70 metrics that SMART stats cover are likely to predict a hard drive failure.
SMART, or Self-Monitoring, Analysis, and Reporting Technology, is nearly ubiquitous firmware that vendors embed as tools to alert IT admins to impending problems.
Due to a lack of industrywide SMART software and hardware standards, SMART data cannot be exchanged between vendor products. Vendors can also use SMART data to analyze issues across drive lines.
For several years, Backblaze has collected data on hard drive failures. It has released that data in company blogs, highlighting which manufacturer’s drives failed more often than others.
Backblaze’s most recent study, the results of which were also published in a company blog post, delved into SMART alerts based on the 40,000 or so hard drives the company has in its data center.
It found that five SMART stats do predict drive failures, according to Backblaze CEO Gleb Budman.
One SMART stat that Backblaze found correlated with impending hard drive failures is 187, a stat that indicates the number of read errors that occur on a hard drive. As they increase, annual failure rates on the drive also climb.
SMART software reports drive issues as normalized values, or categories, which range from SMART stat 1 to 253 (not all numbers in between are included). For example, a value of “1” represents data read error rates, which are displayed as a decimal number. A value of 240 represents the amount of time that a drive spends positioning read/write heads.
Backblaze’s analysis of nearly 40,000 drives showed five SMART metrics that correlate strongly with impending disk drive failure:
Backblaze counts a drive as failed when it is removed from a storage array and replaced because it has totally stopped working or because it has shown evidence of failing soon.
A drive is considered to have stopped working when the drive appears physically dead (e.g. won’t power up), it doesn’t respond to console commands or the RAID system reports that the drive can’t be read or written.
“To determine if a drive is going to fail soon we use SMART statistics as evidence to remove a drive before it fails catastrophically or impedes the operation of the Storage Pod volume,” Budman said.
For example, SMART stat 187 reports the number of reads that could not be corrected using hardware error correction code (ECC). Drives with 0 uncorrectable errors hardly ever fail, Budman said, “but once SMART 187 goes above 0, we schedule the drive for replacement.”
SMART stat 12 relates to drives powering on, which should indicate long-term wear, but didn’t, according to Backblaze.
One problem with fully understanding SMART stats, Budman said, is that drive manufacturers don’t share specific details of use cases for them.
“If you look at the Wikipedia entry for SMART stat 1, for example, it says ‘vendor specific’ value. Seagate wants to track something, but only they know what that is. Western Digital uses SMART for something else – neither will tell you what it is,” Budman said.
“SMART 1 might seem correlated to drive failure rates, but actually it’s more of an indication that different drive vendors are using it themselves for different things,” he added.
Budman pointed to SMART stat 12 as another example of a metric that should indicate an impending drive failure but doesn’t. SMART 12 relates to how many times a drive is powered up, which should correlate to long-term wear. At first, Budman said, the annual failure rate seemed to go up related to SMART 12 alerts, but then the failure rates leveled off and actually went down.
“So at first it looks correlated but it’s not. It doesn’t have a linear progression,” he said. “Whatever indicator they put in there [the SMART firmware], it is not consistent.”