|
One, CPU
Good condition indicators
CPU Utilization: User Time <= 70%, System Time <= 35%, User Time + System Time <= 70%.
Context switching: CPU utilization may be associated with, if the CPU utilization in good condition, a large number of context switches is also acceptable.
Runnable queue: Each processor can run queue <= 3 threads.
Monitoring Tools
vmstat
$ Vmstat 1
You can look at a field-aligned:
The following is a case of someone else's server:
procs ----------- memory ---------- --- swap-- ----- io ---- --system-- ----- cpu ------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1,401,402,904,316,341,912 3,952,308 0 0 0,460,110,695,933,664,100
1,701,402,903,492,341,912 3,951,780 0,000,103,796,143,565,100
2,001,402,902,016,341,912 3.952 million 0,000,104,697,393,564,100
1,701,402,903,904,341,912 3,951,888 0,007,610,449,879,376,300 0
1,601,402,904,580,341,912 3952108 0,000,105,598,083,465,100
Important parameters:
r, run queue, the queue can run several processes, these processes are runnable, but the CPU is temporarily unavailable.
b, the number of processes to be blocked, waiting for IO requests.
in, interrupts, interrupt number to be processed.
cs, context switch, the system is the number of context switches do.
us, the user percentage of CPU usage.
The percentage of CPU-sys, kernel and interrupts.
id, CPU completely idle percentage.
The above example can be obtained:
sy us high low, high frequency and context switches (cs), description of the application of a large number of system calls.
This 4-core machine should be less than 12 r, r now in more than 14 threads, the CPU load is heavy.
See a process of CPU resources
$ While:; do ps -eo pid, ni, pri, pcpu, psr, comm | grep 'db_server_login'; sleep 1; done
PID NI PRI% CPU PSR COMMAND
28577 0 23 0.0 0 db_server_login
28578 0 23 0.0 3 db_server_login
28579 0 23 0.0 2 db_server_login
28581 0 23 0.0 2 db_server_login
28582 0 23 0.0 3 db_server_login
28659 0 23 0.0 0 db_server_login
......
Two, Memory
Good condition indicators
swap in (si) == 0, swap out (so) == 0
Applications available memory / system physical memory <= 70%
Monitoring Tools
vmstat
$ Vmstat 1
procs ----------- memory ---------- --- swap-- ----- io ---- --system-- ----- cpu ------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0,325,269,624,322,687,148 3604 2368 3608 2372 2882880021781
0,225,348,422,162,287,104 consisting of 5368 2976 5372. 3036 9305190001000
0,125,925,226,161,286,148 19 784 18 712 19 784 1,871,238,211,853,013,951
1,226,000,821,881,446,824 11,824,258,412,664 2,584,134,711,741,400,860
2,126,214,029,641,285,852 24912 17304 24952 17304 473,723,418,610,004
Important parameters:
swpd, SWAP space has been used, KB units.
free, available physical memory size, KB units.
buff, the physical size of the cache memory is used to buffer read and write operations, KB units.
cache, the physical memory is used to cache the process address space cache size, KB units.
si, data is read from the size of SWAP RAM (swap in) of, KB as a unit;
so, data is written to SWAP (swap out) the size of the RAM, KB units.
The above example can be obtained:
Physical memory available free basically no significant change, swapd gradually increased, indicating the smallest available memory remains at 256MB (physical memory size) * 10% = about 2.56MB, when the dirty pages to reach 10% of the time began extensive use of swap.
free
$ Free -m
total used free shared buffers cached
Mem: 8111 7185 926 0 243 6299
- / + Buffers / cache: 643 7468
Swap: 8189 0 8189
Third, disk IO
Good condition indicators
iowait% <20%
Improve the hit rate of a simple way to increase the file cache area, the larger the cache, the more pre-existing page, the hit rate is also higher.
Linux kernel is to make a possible second page fault (read from the file cache), and can avoid the main page fault (from the hard disk) as much as possible, so that with the increase of times a page fault, the file cache is gradually increased Great, until the system is only a small amount of physical memory available when Linux started to release some unused pages.
Monitoring Tools
View the physical memory and file cache case
$ Cat / proc / meminfo
MemTotal: 8182776 kB
MemFree: 3053808 kB
Buffers: 342704 kB
Cached: 3972748 kB
This server a total of 8GB of physical memory (MemTotal), about 3GB of available memory (MemFree), about 343MB used for disk cache (Buffers), about 4GB used for file cache (Cached).
sar
$ Sar -d 2 3
Linux 2.6.9-42.ELsmp (webserver) 11/30/2008 _i686_ (8 CPU)
11:09:33 PM DEV tps rd_sec / s wr_sec / s avgrq-sz avgqu-sz await svctm% util
11:09:35 PM dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:09:35 PM DEV tps rd_sec / s wr_sec / s avgrq-sz avgqu-sz await svctm% util
11:09:37 PM dev8-0 1.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00
11:09:37 PM DEV tps rd_sec / s wr_sec / s avgrq-sz avgqu-sz await svctm% util
11:09:39 PM dev8-0 1.99 0.00 47.76 24.00 0.00 0.50 0.25 0.05
Average: DEV tps rd_sec / s wr_sec / s avgrq-sz avgqu-sz await svctm% util
Average: dev8-0 1.00 0.00 19.97 20.00 0.00 0.33 0.17 0.02
Important parameters:
await represents the average wait time each time the device I / O operations (in milliseconds).
svctm represents the average service time each time the device I / O operations (in milliseconds).
% Util represents one second a few percent of the time for I / O operations.
If svctm value and await very close, showing almost no I / O wait, good disk performance, if the value is much higher than the value of svctm await, and said I / O queue wait too long, the application running on the system slow.
If% util close to 100%, the resulting disk I / O requests are too much, I / O system has been working at full capacity, the disk may be a bottleneck.
Four, Network IO
For UDP
Good condition indicators
Receive, transmit buffer is no longer pending network packets.
Monitoring Tools
netstat
For UDP services, view all network conditions listening UDP port
$ Watch netstat -lunp
Proto Recv-Q Send-Q Local Address Foreign Address State PID / Program name
udp 0 0 0.0.0.0:64000 0.0.0.0:* -
udp 0 0 0.0.0.0:38400 0.0.0.0:* -
udp 0 0 0.0.0.0:38272 0.0.0.0:* -
udp 0 0 0.0.0.0:36992 0.0.0.0:* -
udp 0 0 0.0.0.0:17921 0.0.0.0:* -
udp 0 0 0.0.0.0:11777 0.0.0.0:* -
udp 0 0 0.0.0.0:14721 0.0.0.0:* -
udp 0 0 0.0.0.0:36225 0.0.0.0:* -
RecvQ, SendQ 0, or no time value greater than 0 is relatively normal.
For UDP service to view packet loss (NIC received, but did not deal with the application layer over packet loss)
$ Watch netstat -su
Udp:
278073881 packets received
4083356897 packets to unknown port received.
2474435364 packet receive errors
1079038030 packets sent
packet receive errors that a numerical increase, it indicates that the packet loss.
For TCP
Good condition indicators
For TCP, it will not be because of lack of buffer and packet loss thing, because otherwise the network, leading to lost packets by protocol layer will be lost retransmission mechanisms to ensure packets reach the other side.
So, tcp terms more specifically focus on transfer rate.
Monitoring Tools
# Cat / proc / net / snmp | grep Tcp:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 105112 76272 620 23185 6 2183206 2166093 550 6 968812
Retransmission rate = RetransSegs / OutSegs
As for the number of values within the range, considered ok, and depends on the specific business.
Business side is more concerned about the response time. |
|
|
|