Home IT Linux Windows Database Network Programming Server Mobile  
  Home \ Linux \ IOwait Linux system monitoring diagnostic tools     - Eclipse-ee Tomcat browser can not start Tomcat, and Web project service deployment (Server)

- The multiplexed signal driving IO (Programming)

- Git and GitHub use of Eclipse and Android Studio (Programming)

- Linux common network tools: batch scanning of nmap hosting service (Linux)

- Linux Basics Tutorial: Linux Kickstart automated installation (Linux)

- Install Java 8 on Ubuntu using PPA (Linux)

- Install Ubuntu 14.04 and Windows 8 / 8.1 dual-boot in UEFI mode (Linux)

- 20 Linux commands interview questions and answers (Linux)

- JavaScript common functions summary (Programming)

- Copy Recovery using RMAN repository development environment (Database)

- MariaDB 10 Multi-source replication (Database)

- VMware virtual machine operating system log Error in the RPC receive loop resolve (Linux)

- Ubuntu installation under Scrapy (Linux)

- Preps - Print within the specified range of IP addresses (Linux)

- Linux garbled file delete method (Linux)

- Linux basic introductory tutorial ---- simple text processing (Linux)

- Build Python3.4 + PyQt5.3.2 + Eric 6.0 development platform Ubuntu 14.04 (Server)

- Installation and Configuration rsync server under CentOS 6.3 (Server)

- Android project using the command to create and install the package (Programming)

- Nginx is used to build the cache module srcache_nginx (Server)

  IOwait Linux system monitoring diagnostic tools
  Add Date : 2017-08-31      
  Table of Contents

1, Issue:
2, the investigation:
2.1 vmstat
2.2 iostat
2.3 iotop
3, the final words: another way

1, Issue:

Recently doing log real-time synchronization, on-line is done before the single parts online log stress tests, message queues and client, the machine no problem, but did not expect to log on after the second, the question arises:

Cluster a machine top to see the giant high load, cluster machine hardware configurations, software deployment are the same, but only the load station in question, the initial guess may have a hardware problem.

At the same time, we also need to have abnormal load pulling out culprit, the time from the software, the hardware level, respectively, to find a solution.

2, the investigation:

From the top you can see the high load average,% wa high,% us low:

We can generally infer IO bottlenecks encountered, we can reuse the following relevant IO diagnostic tool, the specific verification investigation.

PS: If you do not understand the usage of the top, please refer to: Linux system monitoring, diagnostic tool of the top command Detailed

Common combinations of the following categories:

With vmstat, sar, iostat detect whether a CPU bottleneck
With free, vmstat detect whether the memory bottleneck
With iostat, dmesg detect whether the disk I / O bottlenecks
Netstat is used to detect whether the network bandwidth bottleneck
2.1 vmstat

Meaning vmstat command to display the status of virtual memory ( "Viryual Memor Statics"), but it can report on the process, memory, I / O and so the whole system running.

Its related fields are as follows:

Procs (processes)

r: the number of processes in the run queue, this value can also determine whether you need to increase CPU. (Greater than 1 long)
b: the number of processes waiting for IO, which is in a non-interrupted sleep state the number of processes to show the number of tasks being executed and waiting for CPU resources. When this value exceeds the number of CPU, the CPU bottleneck occurs
Memory (RAM)

swpd: virtual memory size, if the value is not 0 swpd, but the value SI, SO long to zero, this does not affect system performance.
free: free physical memory size.
buff: used as a buffer memory size.
cache: as a cache memory size, cache if the value of the big time, the documentation cache at a few more, if frequent access to the files could be at the cache, the disk's read IO bi will be very small.
Swap (swap)

si: writes per second from the swap memory size from disk into memory.
so: write swap memory size per second transferred by the memory disk.
Note: When enough memory, these two values are 0, if these two long-term value is greater than 0, the system performance will be affected, disk IO and CPU resources will be consumed. Some people see free memory (free) little or close to zero, it is considered not enough memory, and not just look at it, but also with si and so, if rarely free, but si and so few (most of the time is 0), then do not worry, then the system performance will not be affected.

IO (Input Output)

(Now the size of the Linux version of the block is 1kb)

bi: number of blocks read per second
bo: number of blocks written per second
Note: when random disk reads and writes, the two larger value (eg, outside 1024k), able to see the value in the IO wait CPU will be greater.

system (system)

in: interrupts per second, including the clock interrupts.
cs: Context switching number per second.
Note: The above two larger the value, you will see the CPU time consumed by the kernel will be greater.


(Expressed as a percentage)

us: user process execution time percentage (user time). Values us relatively high, indicating the user process consumes more CPU time, but if long-term use of over 50%, then we should consider the optimization algorithm or program accelerated.
sy: system kernel process execution time percentage (system time). Sy value is high, indicating that the system kernel CPU resources consumed, and this is not healthy performance, we should check the reason.
wa: IO wait time percentage. Wa value is high, indicating IO wait more serious, this may be due to a large number of disks for random access cause, there may be a disk bottleneck (block operation).
id: idle time percentage
Can be seen from the vmstat, CPU most of the time wasted waiting IO above, may be due to a large amount of random access of disk or disk bandwidth is caused, bi, bo are more than 1024k, it should be met IO bottleneck.

2.2 iostat

Here then more professional disk IO diagnostic tools look at the statistical data.

Its related fields are as follows:

rrqm / s: The number of read operations performed per second merge. That delta (rmerge) / s
Write the number of operations per second merge: wrqm / s. That delta (wmerge) / s
r / s: completed per second read I / O device number. That delta (rio) / s
w / s: per second complete write I / O device number. That delta (wio) / s
rsec / s: second reading sectors. That delta (rsect) / s
wsec / s: write the number of sectors per second. That delta (wsect) / s
rkB / s: the number of bytes read per second K. It is rsect / s half, as each sector size of 512 bytes. (Computing needs)
wkB / s: write the number of bytes per second K. It is wsect / s in half. (Computing needs)
avgrq-sz: the average size of each data device I / O operations (sector). delta (rsect + wsect) / delta (rio + wio)
avgqu-sz: average I / O queue length. That delta (aveq) / s / 1000 (because aveq of milliseconds).
await: the average waiting time each device I / O operations (in milliseconds). That delta (ruse + wuse) / delta (rio + wio)
svctm: The average service time of each device I / O operations (in milliseconds). That delta (use) / delta (rio + wio)
% Util: one second what percentage of the time for I / O operations, or how much time is one second I / O queue is not empty. That delta (use) / s / 1000 (because the use of milliseconds)
You can see two hard disk utilization sdb has been 100%, there is a serious IO bottleneck, the next step is to find out which process we have in the hard drive to read and write data.

2.3 iotop

According to the results iotop, we quickly locate the problem flume process, resulting in a large number of IO wait.

But at the beginning I have said, as a cluster machine configuration, deployment are also rsync past exactly, is it a bad hard drive?

It was looking to verify the operation and maintenance of the students, the final conclusion is:

Sdb is Double raid1, using raid card "LSI Logic / Symbios Logic SAS1068E", no cache. Nearly 400 IOPS pressure has reached the hardware limit. The raid card other machines using the "LSI Logic / Symbios Logic MegaRAID SAS 1078", has a 256MB cache, did not reach the hardware bottlenecks, the solution is able to provide more IOPS replacement machines, such as the last one we changed the band PERC6 / integrated RAID controller card machine. It should be noted, the information in the raid raid card and disk firmware inside each keep a copy of the information and information format raid raid card above the disk if the match, otherwise the raid card identification not need to format the disk.

IOPS essentially depends on the disk itself, but there are many methods to enhance IOPS, plus hardware cache, using RAID arrays are commonly used methods. If this is the kind of high IOPS DB scene, now popular with the SSD to replace the traditional mechanical hard drives.

But the front said, and we started from both hardware and software is to see whether the object separately seek the least costly solution:

Hardware know the reason, we can try to put the disk read and write operations to another block, and then look at the results:

3, the final words: another way

In fact, in addition to locating the problem with the above professional tools, we can use the process to find the status of related processes.

We know that the process has the following states:

D uninterruptible sleep (usually IO)
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct ( "zombie") process, terminated but not reaped by its parent.
Wherein the state D in general is due to the wait IO caused by so-called "non-interrupted sleep," we can start from this point and then a step by step localization issues:

# For x in `seq 10`; do ps -eo state, pid, cmd | grep" ^ D "; echo" ---- "; sleep 5; done
D 248 [jbd2 / dm-0-8]
D 16528 bonnie ++ - n 0-u 0-r 239-s 478-f -b -d / tmp
D 22 [kdmflush]
D 16528 bonnie ++ - n 0-u 0-r 239-s 478-f -b -d / tmp
# Or:
# While true; do date; ps auxf | awk '{if ($ 8 == "D") print $ 0;}'; sleep 1; done
TueAug2320: 03: 54 CLT 2011
root 3020.00.000 D May222:? 58 \ _ [kdmflush]
root 3210.00.000 D May224:? 11 \ _ [jbd2 / dm-0-8]
TueAug2320: 03: 55 CLT 2011
TueAug2320: 03: 56 CLT 2011
# Cat / proc / 16528 / io
rchar: 48752567
wchar: 549961789
syscr: 5967
syscw: 67138
read_bytes: 49020928
write_bytes: 549961728
cancelled_write_bytes: 0
# Lsof -p 16528
bonnie ++ 16528 root cwd DIR 252,04096130597 / tmp
< Truncated>
bonnie ++ 16528 root 8u REG 252,0501219328131869 / tmp / Bonnie.16528
bonnie ++ 16528 root 9u REG 252,0501219328131869 / tmp / Bonnie.16528
bonnie ++ 16528 root 10u REG 252,0501219328131869 / tmp / Bonnie.16528
bonnie ++ 16528 root 11u REG 252,0501219328131869 / tmp / Bonnie.16528
bonnie ++ 16528 root 12u REG 252,0501219328131869 < strong> /tmp/Bonnie.16528 < / strong>
# Df / tmp
Filesystem1K-blocks UsedAvailableUse% Mounted on
/ Dev / mapper / workstation-root 76671402628608465392037% /
# Fuser -vm / tmp
/ Tmp: db2fenc1 1067 .... m db2fmp
db2fenc1 1071 .... m db2fmp
db2fenc1 2560 .... m db2fmp
db2fenc1 5221 .... m db2fmp
- Command line tool Tmux (Linux)
- The method of Linux into the rescue mode (Linux)
- Linux installation and configuration curl command tool (Linux)
- CentOS 7.1 install NTFS-3G (Linux)
- Impact test noatime Linux file access time (Linux)
- Oracle restrict certain IP, the malicious user actions on important table (Database)
- Ubuntu 14.04 Enable root and disable the guest (Linux)
- The basic principle of pointers in C ++ (Programming)
- Linux System Getting Started Tutorial: permission to permanently modify the USB device in Linux (Linux)
- CentOS network configuration 7, and set the host name and IP-bound problems (Linux)
- Java 8 Lambda principle analysis (Programming)
- Using IntelliJ IDEA Import Spark Spark latest source code and compile the source code (Linux)
- Install Xshell on Mac OS X (Linux)
- Docker in the development and practice of IFTTT (Server)
- Linux installation skynet issue summary (Linux)
- CentOS 6.5_x64 install Oracle 11g R2 (Database)
- Error code: 2013 Lost connection to MySQL server during query (Database)
- Camera-based face recognition OpenCV crawl and storage format (Python) (Linux)
- No password on Oracle and MySQL login (Database)
- MySQL simple operation notes under Linux (Database)
  CopyRight 2002-2016 newfreesoft.com, All Rights Reserved.