If your Linux server load suddenly surge, warning messages fast detonator your cell phone, in the shortest time to find out how Linux performance problems? View Netflix Performance Engineering Team blog post, to see them in a minute to diagnose machine performance through ten or so commands.
By executing the following command, you can in one minute of system resource usage have a general understanding.
dmesg | tail
mpstat -P ALL 1
iostat -xz 1
sar -n DEV 1
sar -n TCP, ETCP 1
Some of these commands to install sysstat package, some provided by procps package. The output of these commands to help you quickly locate performance bottlenecks, check out all the resources (CPU, memory, disk IO, etc.) utilization (utilization), saturation (saturation) and error (error) measure, also known USE method.
Let's introduce each of these commands, commands more on these parameters and instructions, refer to the command manual.
23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02
This command can quickly view the load of the machine. In the Linux system, these data represent the process of waiting for CPU resources and blocking the non-interruptible IO process (process status D) number. These data allow us to use the resources of the system to have a macro understanding.
Output of the command represent 1 minute, 5 minutes, the average load for 15 minutes. Through these three data, we can understand the server load is still in the area to ease tensions. If the one minute load average high, and 15 minutes load average is low, indicating that the server is commanding high load conditions, the need for further investigation and CPU resources are consumed in the where. Conversely, if a high load average 15 minutes, one minute load average is low, it is possible that CPU resources tense moment has passed.
The output of the above example, you can see the nearest one minute load average is very high and much higher than the past 15 minutes load, so we need to continue the investigation in the current system what process consumes a lot of resources. It can be described by the following vmstat, mpstat commands further investigation.
History and Statistics tuptime use tools to view Linux server system boot time
dmesg | tail
$ Dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask = 0x280da, order = 0, oom_score_adj = 0
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm: 1972392kB, anon-rss: 1953348kB, file-rss: 0kB
[2320864.954447] TCP:. Possible SYN flooding on port 7001. Dropping request Check SNMP counters.
The command will output the last 10 lines of system log. Example of output, you can see a kernel oom kill and a TCP packet loss. These logs can help troubleshoot performance issues. Do not forget this step.
$ Vmstat 1
procs --------- memory ---------- --- swap-- ----- io ---- -system-- ------ cpu-- ---
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0,020,088,979,273,708 5,918,280,005,610,961,300
32 0,020,088,992,073,708 five hundred and ninety-one thousand eight hundred and sixty 0,005,921,328,442,829,811 0 0
0,020,089,011,273,708 five hundred and ninety-one thousand eight hundred and sixty 32 0 0 0,095,012,154,991,000
32 0,020,088,956,873,712 five hundred ninety-one thousand eight hundred fifty-six 0,004,811,900,245,999,000 0
32 0,020,089,020,873,712 591.86 thousand 0,000,158,984,840,981,100
vmstat (8) command, each line output will be some system core indicators, which allows us a more detailed understanding of the system status. Parameter is followed by 1, represents the output once per second statistics, the first table suggests the meaning of each column, and these are some performance tuning related columns:
r: In the process of waiting for the number of CPU resources. This data demonstrated more CPU load than the average load, the data is not included in the process of waiting for IO. If this value is greater than the number of machine CPU core, the CPU resources of the machine is already saturated.
free: the amount of memory available to the system (in kilobytes), if the remaining memory is insufficient, the system will lead to performance problems. Introduction to free the command below, more detailed understanding of system memory usage.
si, so: swap write and read the number. If this data is not zero, indicating that the system has been in use swap (swap), the machine has insufficient physical memory.
us, sy, id, wa, st: these are representative of the CPU time consumption, which represent user time (user), system (kernel) time (sys), idle time (idle), IO wait time (wait) and to be stolen (stolen, is generally consumed other virtual machine).
These CPU time, allows us to quickly find out whether the CPU be busy. Under normal circumstances, if the user time and system time adding very large, CPU busy for executing instructions. If the IO wait a long time, then the bottleneck in the system may be disk IO.
You can see the example of the command output, consumes a lot of CPU time in user mode, the user application that is consuming CPU time. This is not necessarily a performance problem requires a combination of r queue, analyzed together.