Hard Drive Test in Linux
Posted: Sep 29, 2015 by Bryan Tong
Ever wanted to see how well your hard drive actually performs? Do you burn in your hard drives before putting servers / computers into production? Here are some tips and some quick commands to help ensure your disks are operating properly.
Using dd
Read testing
If you have a small and quick drive. I also do a burn-in especially on the OS drive before releasing a server into production. Initially I just want to read all the way across the disk to make sure if there are any bad blocks we blow up now and not later.
$ dd if=/dev/sda of=/dev/null bs=1M
This is assuming your drive is /dev/sda
if you are not sure where your disk is. I try doing fdisk -l
or ls -l /dev/sd*
This command will safely read the entire contents of the drive (including partition tables) into null
or nothingness. If this command completes without error we can have a reasonably good idea that the drive is working well.
Here is the result on a 32GB Samsung mSATA drive.
root@svr:~# dd if=/dev/sda of=/dev/null bs=1M
30533+1 records in
30533+1 records out
32017047552 bytes (32 GB) copied, 84.5127 s, 379 MB/s
A cool thing about this testing as it will also give you a fair benchmark on the read speed. (However, if you are benchmarking I would recommend add oflag=direct
to skip the Linux VFS and its caching.
Write Testing
We have to be a lot more careful with write testing as there are a lot of factors going into making sure the disk is actually being written to.
Depending on file size limitations I use a combination of tools to test writing.
$ cd /root
$ dd if=/dev/zero of=testfile bs=1M count=10k oflag=direct
Here I am writing from /dev/zero
which will produce only zeros and writing them to /root/testfile
(this path needs to be somewhere on the disk you want to test) and I am going to write a 1 Megabyte block size (BS) and I am going to write 10,000 (10k) blocks. Thus, I am writing 10GB. Finally I add oflag=direct
to tell dd to bypass the Linux VFS (and caching) and write directly to the block device. Make sure that there is enough free space to write the file before continuing and make sure that the path you are using to write the file is on the disk you are wanting to test.
Here is the result on my same mSATA disk.
root@svr:~# dd if=/dev/zero of=testfile bs=1M count=2k oflag=direct
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 24.5838 s, 87.4 MB/s
Just as a note this gives you an idea of the write performance of the disk as well.
More Write Testing with Bonnie++
On Debian I install Bonnie++ first.
$ apt-get -y install bonnie++
Next I run a basic test.
cd /root
mkdir bonnie
bonnie++ -d /root/bonnie -u root
If you are running as root
then you must add the -u root
or Bonnie++ will complain. If you are another user this flag can be omitted.
Bonnie++ will read and write from the disk with intelligent patterns and again if there are any problems they should start to appear again now and not when the server is in production.
Checking Disk Health with DMESG
Now that we have done some testing on both ends. It is time to see if the disk we have been testing has any errors. The first place I check is dmesg
.
$ dmesg | tail
If you see anything like this.
end_request: I/O error, dev sda, sector 63
This will tell you that the disk has a bad sector.
Or if you see something like this.
[ 681.472852] ata1.00: failed command: READ DMA EXT
[ 681.472856] ata1.00: cmd 25/00:00:f8:eb:bd/00:01:1d:00:00/e0 tag 0 dma 131072 in
[ 681.472856] res 51/84:b0:48:ec:bd/84:00:1d:00:00/e0 Emask 0x70 (host bus error)
[ 681.472858] ata1.00: status: { DRDY ERR }
[ 681.472859] ata1.00: error: { ICRC ABRT }
[ 681.472866] ata1.00: hard resetting link
[ 681.791147] ata1.01: hard resetting link
[ 682.818130] ata1.01: failed to resume link (SControl 0)
[ 682.974052] ata1.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 682.974067] ata1.01: SATA link down (SStatus 0 SControl 0)
[ 682.998511] ata1.00: configured for UDMA/33
[ 682.998861] ata1: EH complete
[ 683.215898] ata1.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6
[ 683.215901] ata1.00: BMDMA stat 0x26
[ 683.215904] ata1.00: SError: { UnrecovData HostInt 10B8B BadCRC }
[ 683.215906] ata1.00: failed command: READ DMA EXT
[ 683.215909] ata1.00: cmd 25/00:88:a0:16:cb/00:00:1d:00:00/e0 tag 0 dma 69632 in
[ 683.215909] res 51/84:38:f0:16:cb/84:00:1d:00:00/e0 Emask 0x70 (host bus error)
[ 683.215911] ata1.00: status: { DRDY ERR }
[ 683.215912] ata1.00: error: { ICRC ABRT }
[ 683.215918] ata1.00: hard resetting link
It is telling you it is having trouble communicating with the drive. In my case with my mSATA drives I see these when the adapter card isnt seated properly on the Motherboard. However, these problems can also occur with faulty drives. Either way, I dont recommend putting servers into production if any of these errors are present.
If you have a very long dmesg I recommend using grep to filter for errors.
$ dmesg | grep -i ata
$ dmesg | grep -i sector
In my case the server I was testing on did have a problem which can be seen in these messages
[ 1676.371513] ata1.00: exception Emask 0x10 SAct 0x783fffff SErr 0x400000 action 0x6 frozen
[ 1676.371597] ata1.00: irq_stat 0x08000000, interface fatal error
[ 1676.371666] ata1: SError: { Handshk }
[ 1676.371727] ata1.00: failed command: WRITE FPDMA QUEUED
[ 1676.371797] ata1.00: cmd 61/00:00:00:30:35/04:00:01:00:00/40 tag 0 ncq 524288 out
[ 1676.371798] res 40/00:a8:00:c4:35/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
[ 1676.371946] ata1.00: status: { DRDY }
[ 1676.372007] ata1.00: failed command: WRITE FPDMA QUEUED
[ 1676.372077] ata1.00: cmd 61/00:08:00:34:35/04:00:01:00:00/40 tag 1 ncq 524288 out
[ 1676.372078] res 40/00:a8:00:c4:35/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
[ 1676.372329] ata1.00: status: { DRDY }
Checking Disk Health with SMART
Another very useful tool for checking if your disk is operating efficiently is the SMART tools that come with most all distributions of Linux.
On Debian they need to be installed.
$ apt-get -y install smartmontools
Once we are done with that we can check our disk out.
$ smartctl -a /dev/sda
I will post my output and analyze some variables.
root@svr:~# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG MZMPC032HBCD-000H1
Serial Number: S0Y6NSAC584XXX
LU WWN Device Id: 5 002538 043584XXX
Firmware Version: CXM12H1Q
User Capacity: 32,017,047,552 bytes [32.0 GB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ACS-2 revision 2
Local Time is: Tue Sep 29 12:54:47 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 180) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 002 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1132
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 798
170 Unknown_Attribute 0x0013 086 086 010 Pre-fail Always - 352
171 Unknown_Attribute 0x0032 100 100 010 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 010 Old_age Always - 0
173 Unknown_Attribute 0x0013 096 096 017 Pre-fail Always - 134
174 Unknown_Attribute 0x0032 099 099 000 Old_age Always - 14
183 Runtime_Bad_Block 0x0032 099 099 001 Old_age Always - 2
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 063 033 000 Old_age Always - 37
196 Reallocated_Event_Count 0x0002 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short captive Completed without error 00% 666 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The first thing I check is Raw_Read_Error_Rate and make sure this is 0
Next I check Reallocated_Sector_Ct and also make sure it is 0
I also check Runtime_Bad_Block and make sure it is 0 (in this case it wasnt as this server is having link errors unrelated to the drive.)
Offline_Uncorrectable should also be 0
UDMA_CRC_Error_Count should be 0 as well.
Most of the other attributes can be ignored when checking if your drive is truly malfunctioning. In fact, some of these counters can be above 0 on a healthy drive, however if the drive is new which can be gauged by looking at Power_On_Hours and any of the counters I mentioned are increasing there is reason for concern.
Well that is it! I have went over several methods to check and confirm drive health on servers and desktops before depending on the storage mediums in the machine.