|
Sometimes there have been environmental issues, is a good way of contrast, if the comparison properly, can avoid repeating the problem, we can infer some of the situations or problems that may arise under the circumstances comparison.
If the comparison properly, it is possible to draw the wrong conclusions. Today, simply a few examples to illustrate this point.
Compare MySQL restart
He appeared before a backup machine had a hardware failure, but fortunately thanks to a mean value on the backup server, backup machine has been preparing for the library, but the actual discovery by the library and the main library on the standby machine are irrelevant, but also makes cold sweat, then prepare it to build the library, and found that the main library is not open binlog, in this case there is no way, so after the evaluation found that there is an environmental problem is the same, so I applied for a window of time to do the work restarted, during the database itself also considered the level parameter optimization, so restart involving two sets of environment, a 5.5, a 5.6, to be sure, also maintained a 5.6 mysql 5.5 original configuration, there is no open gtid. reboot the process is no technical content, but there are some warning after the restart from the database log, alarm information as follows:
2015-12-22 07:42:23 26782 [Warning] Aborted connection 1238 to db: 'unconnected' user: 'unauthenticated' host: 'gate_app_4.172' (Got an error reading communication pack
ets)
2015-12-22 07:42:30 26782 [Warning] Aborted connection 1242 to db: 'unconnected' user: 'unauthenticated' host: 'gate_app_131.41' (Got an error reading communication pac
kets)
This let us quite unexpectedly, in this case, from the comparison point of view, there are several scenarios.
Contrast Scenario 1: 5.5, 5.6 is not a parameter setting affects whether is the 5.6 bug, because the problem is 5.6 mysql server.
Obviously not, because I do not have 5.6 configuration change another parameter, just opened the binlog, the original configuration to be safe do not change. Changes in two environments are the same.
Contrast Scenario 2: 5.6 is not restarted when the environment is a problem.
This can also be ruled out, because the two servers are doing the restart, another server will be no similar problems.
Contrast Scenario 3: For this question, from the application side need to see whether there has been a long connection unreleased
The investigation also carried out, in view of the application side, it found no problems related to the environment and really a lot.
Contrast Scenario 4: Recently, there are some changes to the network, whether dns changes have an impact
This system also help to assist the group, but did not find anything related logs.
Contrast Scenario 5: After the restart before restarting and compared.
Log information before the restart after restart and whether there is a large discrepancy, then view the error.log time to see the alarm information reported several pages, there will be no more turn and then move forward, read pages are 4,5 the alarm information, which thought to view the previous log and found that previously had a similar problem.
So this contrast, there is a reference, and other environments do comparison, may also draw some relevant conclusions, but the current environment to do comparison before and after the restart to better reflect the roots of the problem, if there is such a problem before , it shows that this is a historical problem, these applications can already stop trying to connect.
MySQL Import dump
The front end of the time to do several sets of data in the cloud-based MySQL server migration, encountered several problems, it was also relatively bothers me.
Because the amount of data, so we did the logical use of mysqldump to export, then import the logic directly in the target environment. Because it is a new environment, so some import library without any problems, there is always a library will automatically exit when imported.
Content is being given:
ERROR 2013 (HY000) at line 8441: Lost connection to MySQL server during query
Of course, for this problem, with a look at a few scenarios to try to contrast.
First memory environment is 16g, the presence of three dump, respectively, 10g, 20g, 30g, in order to save the very beginning, I opened three nohup processes to concurrent imports data in different databases.
Scenario 1: concurrent imports three dump, import fails
Scene 2: The serial is also being given to import, and then import using a serial manner, still fail, because they also view the logs later found that the import fails, the successful completion of all uncertain.
Scenario 3: Of course, the FBI still stage, so I have some time to do some tests, and then import 20G, found still fail.
Scene 4: In accordance with the idea of comparison, certainly not import 30g, indeed can not import, but found 30g of dump import at one table partition will fail
Scenario 5: Try to 30g of the dump in the partition table into a separate, still find a problem. Fortunately, however, began to view the log and found that the oom-killer result, the remaining memory and less closely related, of course, and swap-related.
Scenario 6: After the discovery of these cloud servers are not configured swap, add a swap, introducing 10g of dump, it is successful.
Scene 7: Import 20g, also successful, but the swap usage is about 10g, swap configured 16G, why it hovered around 10g, the default configuration and swap usage related, the default is 60%, which is about 9.6G and the phenomenon of 10g is the same. So why would consume a large amount of swap it, initially suspected to be online because imported because the FBI began to do business, the application can not be stopped, so there have been cases of import data online.
Scenario 8: So then import 30g of dump, if it is still successful, unfortunately this still failed, because of a oom-killer, the corresponding thread is terminated, swap complete release, swap usage suddenly reset 5M up.
Scene 9: 30g this time to try again to import the dump, there is no problem, but because the import line, there will be some lock wait, and the consumption of resources for really high enough, swap usage to around 10G
Scene 10: dump has been imported successfully, why the swap did not release it, one way is to re-mount the swap partition, but depressing is or because of insufficient memory. It reported the following error.
#swapoff -a
swapoff: / home / swapfile: swapoff failed: Can not allocate memory
So how do to change this situation, the current situation is a unique work of restarting the program. Contact restart of the application, but have not coordinated it, so it is so delayed a few days.
Scene 11: A few days after I once again look and found the swap has been automatically reset, and reset reason is oom-killer, it seems there should be a connection after being forced to terminate triggered oom-killer, then swap complete release.
So much seems trivial scenes, there is a problem of memory is why it is always not enough, in addition to swap still should be some reason for it, and finally found that there is a reason that buffer_pool_size set too large, had 16g memory, even setting results buffer_pool_size the 24G, why is there such a stupid mistake it, traced the original discovery template used only check RedHat, no check CentOS, which is installed on this server is precisely centos, so when in the initialization parameter is set to direct became 24G, so there are some problems in this template, own validation logic is not stringent enough.
Comparative around and found contrast can sometimes help us to analyze problems, sometimes mislead us, there are degrees of everything, of course, if you do one thing, there is no output and conclusions, there is no practical significance. |
|
|
|