@cdmonkey
2017-05-11T02:03:19.000000Z
字数 8276
阅读 2256
Nagios
https://github.com/v-zhuravlev/zbx-smartctl
SMART: Self Monitoring Analysis and Reporting Technology
Download: http://www.smartmontools.org/wiki/Download
Linux.cn: http://linux.cn/article-4461-weixin.html
http://my.oschina.net/u/877567/blog/307336
原创硬盘检测:http://czmmiao.iteye.com/blog/1058215
服务器硬盘是最容易出现问题及发生故障的部件,而且硬盘中存储着大量重要的数据,万一出现故障所造成的损失也是无法估计的,轻则需要化费大量的时间与精力去做数据恢复,重则硬盘报废,里面重要的数据也无法完全恢复,所以对硬盘健康状态的检测,就尤其重要了。为了避免遇到硬盘损坏的情况,用户可使用该软件包程序,它通过使用自身监测、分析及报告三种技术来实施监管存储硬件。
安装依赖包:
yum install gcc gcc-c++
[root@oadb tools]# tar zxvf smartmontools-6.4.tar.gz[root@oadb tools]# cd smartmontools-6.4[root@oadb smartmontools-6.4]# ./configure --prefix=/usr/local --without-selinuxmake && make install------------------# Yum install:yum install -y smartmontools
# View the version information:[root@oadb ~]# smartctl -Vsmartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org...
使用一台配有八块硬盘的服务器进行测试:
[root@kvm-test ~]# smartctl --scan-open/dev/sda -d scsi # /dev/sda, SCSI device/dev/bus/6 -d megaraid,0 # /dev/bus/6 [megaraid_disk_00], SCSI device/dev/bus/6 -d megaraid,1 # /dev/bus/6 [megaraid_disk_01], SCSI device/dev/bus/6 -d megaraid,2 # /dev/bus/6 [megaraid_disk_02], SCSI device/dev/bus/6 -d megaraid,3 # /dev/bus/6 [megaraid_disk_03], SCSI device/dev/bus/6 -d megaraid,4 # /dev/bus/6 [megaraid_disk_04], SCSI device/dev/bus/6 -d megaraid,5 # /dev/bus/6 [megaraid_disk_05], SCSI device/dev/bus/6 -d megaraid,6 # /dev/bus/6 [megaraid_disk_06], SCSI device/dev/bus/6 -d megaraid,7 # /dev/bus/6 [megaraid_disk_07], SCSI device
其中:“megaraid”应该是阵列卡芯片名称,
megaraid,0中的0代表的是于“megaraid”中的物理盘编号。
取得设备的“smart”信息:
[root@kvm-test ~]# smartctl --all /dev/sdasmartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===Vendor: DELLProduct: PERC H710Revision: 3.13User Capacity: 2,096,078,258,176 bytes [2.09 TB]Logical block size: 512 bytesLogical Unit id: 0x6c81f660e79791001ec4a6da05d41165Serial number: 006511d405daa6c41e009197e760f681Device type: diskLocal Time is: Wed Apr 19 13:50:19 2017 CSTSMART support is: Unavailable - device lacks SMART capability.=== START OF READ SMART DATA SECTION ===Current Drive Temperature: 0 CDrive Trip Temperature: 0 CError Counter logging not supportedDevice does not support Self Test logging
能够看到,阵列卡本身并不支持“SMART”功能。再具体看下单块硬盘的情况:
[root@kvm-test ~]# smartctl -i -d megaraid,0 /dev/sdasmartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===Vendor: SEAGATEProduct: ST9300653SSRevision: YS0ACompliance: SPC-4User Capacity: 300,000,000,000 bytes [300 GB]Logical block size: 512 bytesRotation Rate: 15000 rpmForm Factor: 2.5 inchesLogical Unit id: 0x5000c500716ff47fSerial number: 6XN5SGF1Device type: diskTransport protocol: SAS (SPL-3)Local Time is: Wed Apr 19 14:49:20 2017 CSTSMART support is: Available - device has SMART capability.SMART support is: EnabledTemperature Warning: Disabled or Not Supported
能够看到,硬盘设备支持“SMART”功能。
[root@oadb ~]# smartctl -i /dev/sdasmartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.orgSmartctl open device: /dev/sda failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'
# View disk information:[root@oadb ~]# smartctl -i -d megaraid,0 /dev/sdasmartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===Vendor: SEAGATEProduct: ST9300653SSRevision: YS09Compliance: SPC-4User Capacity: 300,000,000,000 bytes [300 GB]Logical block size: 512 bytesRotation Rate: 15000 rpmForm Factor: 2.5 inchesLogical Unit id: 0x5000c5005fb0745bSerial number: 6XN33ER0Device type: diskTransport protocol: SAS (SPL-3)Local Time is: Thu Oct 22 19:02:55 2015 CSTSMART support is: Available - device has SMART capability.SMART support is: EnabledTemperature Warning: Disabled or Not Supported
查看设备的健康状态:
# Check the hard disk health status:[root@oadb ~]# smartctl -H /dev/sda -d megaraid,0smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===SMART Health Status: OK
使用SMART只能报告硬盘已经不再健康,可报警后还能继续运行多久无法确定,通常,SMART报警参数是有预留的,硬盘出现报错信息后,并不会当场就损坏,一般能坚持一段时间,有的硬盘报错后还继续跑了好几年,有的硬盘报错后几天就无法使用了,千万不能存侥幸心理。
# Check the hard disk error log:[root@oadb ~]# smartctl -l error -d megaraid,0 /dev/sdasmartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===Error counter log:Errors Corrected by Total Correction Gigabytes TotalECC rereads/ errors algorithm processed uncorrectedfast | delayed rewrites corrected invocations [10^9 bytes] errorsread: 931486797 0 0 931486797 0 13847.275 0write: 0 0 0 0 0 5962.098 0verify: 3042089148 0 0 3042089148 0 18667.773 0Non-medium error count: 2
[root@oadb ~]# smartctl -A -d megaraid,0 /dev/sdasmartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===Current Drive Temperature: 31 CDrive Trip Temperature: 68 CManufactured in week 10 of year 2013Specified cycle count over device lifetime: 10000Accumulated start-stop cycles: 26Elements in grown defect list: 0Vendor (Seagate) cache informationBlocks sent to initiator = 3753898349Blocks received from initiator = 3003303195Blocks read from cache and sent to initiator = 2259987377Number of read and write commands whose size <= segment size = 40109835Number of read and write commands whose size > segment size = 190Vendor (Seagate/Hitachi) factory informationnumber of hours powered up = 16559.07number of minutes until next internal SMART test = 19
[root@PBSSES01 ~]# smartctl -a /dev/bus/0 -d megaraid,3smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===Vendor: SEAGATEProduct: ST600MP0005Revision: VT31Compliance: SPC-4User Capacity: 600,127,266,816 bytes [600 GB]Logical block size: 512 bytesFormatted with type 2 protectionLU is fully provisionedRotation Rate: 15000 rpmForm Factor: 2.5 inchesLogical Unit id: 0x5000c5009599aff7Serial number: S7M0WHMNDevice type: diskTransport protocol: SAS (SPL-3)Local Time is: Thu Apr 20 14:24:01 2017 CSTSMART support is: Available - device has SMART capability.SMART support is: EnabledTemperature Warning: Disabled or Not Supported=== START OF READ SMART DATA SECTION ===SMART Health Status: OKCurrent Drive Temperature: 34 CDrive Trip Temperature: 60 CManufactured in week 48 of year 2015Specified cycle count over device lifetime: 10000Accumulated start-stop cycles: 23Specified load-unload count over device lifetime: 300000Accumulated load-unload cycles: 489Elements in grown defect list: 0 # 传说中这里是标识出是否有坏道的地方,俗称成长坏道。Vendor (Seagate) cache informationBlocks sent to initiator = 290310405Blocks received from initiator = 583431357Blocks read from cache and sent to initiator = 1944507Number of read and write commands whose size <= segment size = 6490035Number of read and write commands whose size > segment size = 51Vendor (Seagate/Hitachi) factory informationnumber of hours powered up = 10990.77number of minutes until next internal SMART test = 7Error counter log:Errors Corrected by Total Correction Gigabytes TotalECC rereads/ errors algorithm processed uncorrectedfast | delayed rewrites corrected invocations [10^9 bytes] errorsread: 2472572503 0 0 2472572503 0 728.491 0write: 0 0 0 0 0 315.816 0verify: 597797863 0 0 597797863 0 39623.364 0Non-medium error count: 2 # 非介质错误。意思是说不是盘的问题,一般是电缆、传输、校验问题,可忽略。SMART Self-test logNum Test Status segment LifeTime LBA_first_err [SK ASC ASQ]Description number (hours)# 1 Background short Completed 80 3 - [- - -]# 2 Reserved(7) Completed 64 3 - [- - -]Long (extended) Self Test duration: 3360 seconds [56.0 minutes]
通过修改以上的 smartmontools 的设置文件定期对硬盘进行健康检测,如同给人定期体检一样,体检通过了并不代表就没病(很多疾病用体检的设备都查不到),所以这也符合“Google”的硬盘报告所说的情况,所有损坏的情况中只有60%可以被SMART检测到,所以不能完全依赖其检测结果。
[root@PBSSES01 ~]# smartctl -l selftest /dev/bus/0 -d megaraid,3smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION ===SMART Self-test logNum Test Status segment LifeTime LBA_first_err [SK ASC ASQ]Description number (hours)# 1 Background short Completed 80 3 - [- - -]# 2 Reserved(7) Completed 64 3 - [- - -]Long (extended) Self Test duration: 3360 seconds [56.0 minutes]
上面的smartctl看起来是个非常不错的工具,可每次都通过手动运行着实麻烦,如果能够以指定的间隔运行,同时又能通知运维人员其检测结果,那不是更好吗?其实该功能已经有了:通过smartd后台服务发就能够实施该功能。
[root@oadb ~]# vim /etc/smartd.conf# Sample configuration file for smartd. See man smartd.conf.DEVICESCAN -H -m root