@cdmonkey
2017-05-11T10:03:19.000000Z
字数 8276
阅读 1933
Nagios
https://github.com/v-zhuravlev/zbx-smartctl
SMART: Self Monitoring Analysis and Reporting Technology
Download: http://www.smartmontools.org/wiki/Download
Linux.cn: http://linux.cn/article-4461-weixin.html
http://my.oschina.net/u/877567/blog/307336
原创硬盘检测:http://czmmiao.iteye.com/blog/1058215
服务器硬盘是最容易出现问题及发生故障的部件,而且硬盘中存储着大量重要的数据,万一出现故障所造成的损失也是无法估计的,轻则需要化费大量的时间与精力去做数据恢复,重则硬盘报废,里面重要的数据也无法完全恢复,所以对硬盘健康状态的检测,就尤其重要了。为了避免遇到硬盘损坏的情况,用户可使用该软件包程序,它通过使用自身监测、分析及报告三种技术来实施监管存储硬件。
安装依赖包:
yum install gcc gcc-c++
[root@oadb tools]# tar zxvf smartmontools-6.4.tar.gz
[root@oadb tools]# cd smartmontools-6.4
[root@oadb smartmontools-6.4]# ./configure --prefix=/usr/local --without-selinux
make && make install
------------------
# Yum install:
yum install -y smartmontools
# View the version information:
[root@oadb ~]# smartctl -V
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
...
使用一台配有八块硬盘的服务器进行测试:
[root@kvm-test ~]# smartctl --scan-open
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/6 -d megaraid,0 # /dev/bus/6 [megaraid_disk_00], SCSI device
/dev/bus/6 -d megaraid,1 # /dev/bus/6 [megaraid_disk_01], SCSI device
/dev/bus/6 -d megaraid,2 # /dev/bus/6 [megaraid_disk_02], SCSI device
/dev/bus/6 -d megaraid,3 # /dev/bus/6 [megaraid_disk_03], SCSI device
/dev/bus/6 -d megaraid,4 # /dev/bus/6 [megaraid_disk_04], SCSI device
/dev/bus/6 -d megaraid,5 # /dev/bus/6 [megaraid_disk_05], SCSI device
/dev/bus/6 -d megaraid,6 # /dev/bus/6 [megaraid_disk_06], SCSI device
/dev/bus/6 -d megaraid,7 # /dev/bus/6 [megaraid_disk_07], SCSI device
其中:“megaraid”应该是阵列卡芯片名称,
megaraid,0
中的0
代表的是于“megaraid”中的物理盘编号。
取得设备的“smart”信息:
[root@kvm-test ~]# smartctl --all /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: DELL
Product: PERC H710
Revision: 3.13
User Capacity: 2,096,078,258,176 bytes [2.09 TB]
Logical block size: 512 bytes
Logical Unit id: 0x6c81f660e79791001ec4a6da05d41165
Serial number: 006511d405daa6c41e009197e760f681
Device type: disk
Local Time is: Wed Apr 19 13:50:19 2017 CST
SMART support is: Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C
Error Counter logging not supported
Device does not support Self Test logging
能够看到,阵列卡本身并不支持“SMART”功能。再具体看下单块硬盘的情况:
[root@kvm-test ~]# smartctl -i -d megaraid,0 /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST9300653SS
Revision: YS0A
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c500716ff47f
Serial number: 6XN5SGF1
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Apr 19 14:49:20 2017 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
能够看到,硬盘设备支持“SMART”功能。
[root@oadb ~]# smartctl -i /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: /dev/sda failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'
# View disk information:
[root@oadb ~]# smartctl -i -d megaraid,0 /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST9300653SS
Revision: YS09
Compliance: SPC-4
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5005fb0745b
Serial number: 6XN33ER0
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Thu Oct 22 19:02:55 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
查看设备的健康状态:
# Check the hard disk health status:
[root@oadb ~]# smartctl -H /dev/sda -d megaraid,0
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
使用SMART
只能报告硬盘已经不再健康,可报警后还能继续运行多久无法确定,通常,SMART
报警参数是有预留的,硬盘出现报错信息后,并不会当场就损坏,一般能坚持一段时间,有的硬盘报错后还继续跑了好几年,有的硬盘报错后几天就无法使用了,千万不能存侥幸心理。
# Check the hard disk error log:
[root@oadb ~]# smartctl -l error -d megaraid,0 /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 931486797 0 0 931486797 0 13847.275 0
write: 0 0 0 0 0 5962.098 0
verify: 3042089148 0 0 3042089148 0 18667.773 0
Non-medium error count: 2
[root@oadb ~]# smartctl -A -d megaraid,0 /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 31 C
Drive Trip Temperature: 68 C
Manufactured in week 10 of year 2013
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 26
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 3753898349
Blocks received from initiator = 3003303195
Blocks read from cache and sent to initiator = 2259987377
Number of read and write commands whose size <= segment size = 40109835
Number of read and write commands whose size > segment size = 190
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 16559.07
number of minutes until next internal SMART test = 19
[root@PBSSES01 ~]# smartctl -a /dev/bus/0 -d megaraid,3
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST600MP0005
Revision: VT31
Compliance: SPC-4
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate: 15000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5009599aff7
Serial number: S7M0WHMN
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Thu Apr 20 14:24:01 2017 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 34 C
Drive Trip Temperature: 60 C
Manufactured in week 48 of year 2015
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 23
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 489
Elements in grown defect list: 0 # 传说中这里是标识出是否有坏道的地方,俗称成长坏道。
Vendor (Seagate) cache information
Blocks sent to initiator = 290310405
Blocks received from initiator = 583431357
Blocks read from cache and sent to initiator = 1944507
Number of read and write commands whose size <= segment size = 6490035
Number of read and write commands whose size > segment size = 51
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 10990.77
number of minutes until next internal SMART test = 7
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 2472572503 0 0 2472572503 0 728.491 0
write: 0 0 0 0 0 315.816 0
verify: 597797863 0 0 597797863 0 39623.364 0
Non-medium error count: 2 # 非介质错误。意思是说不是盘的问题,一般是电缆、传输、校验问题,可忽略。
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 80 3 - [- - -]
# 2 Reserved(7) Completed 64 3 - [- - -]
Long (extended) Self Test duration: 3360 seconds [56.0 minutes]
通过修改以上的 smartmontools 的设置文件定期对硬盘进行健康检测,如同给人定期体检一样,体检通过了并不代表就没病(很多疾病用体检的设备都查不到),所以这也符合“Google”的硬盘报告所说的情况,所有损坏的情况中只有60%
可以被SMART
检测到,所以不能完全依赖其检测结果。
[root@PBSSES01 ~]# smartctl -l selftest /dev/bus/0 -d megaraid,3
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 80 3 - [- - -]
# 2 Reserved(7) Completed 64 3 - [- - -]
Long (extended) Self Test duration: 3360 seconds [56.0 minutes]
上面的smartctl
看起来是个非常不错的工具,可每次都通过手动运行着实麻烦,如果能够以指定的间隔运行,同时又能通知运维人员其检测结果,那不是更好吗?其实该功能已经有了:通过smartd
后台服务发就能够实施该功能。
[root@oadb ~]# vim /etc/smartd.conf
# Sample configuration file for smartd. See man smartd.conf.
DEVICESCAN -H -m root