A week ago, jointly organized by Intel and RedHat the Shanghai Ceph Day on October 18. At the meeting, a number of experts to do more than a dozen very wonderful speech. In this paper, knowledge and methods Ceph performance optimization of those mentioned in the speech, trying to sum up on their own understanding.
0. conventional Ceph performance optimization
(1) hardware level
Hardware Planning: CPU, memory, network
SSD options: using the SSD as a log storage
BIOS settings: Open Hyper-Threading (HT), turn off energy-saving, and so close NUMA
(2) software level
Linux OS: MTU, read_ahead etc.
Ceph Configurations and PG Number adjustments: the formula (Total PGs = (Total_number_of_OSD * 100) / max_replication_count) calculated using the PG.
For more information, refer to the following article:
Ceph performance optimization summary (v0.94)
Measure Ceph RBD performance in a quantitative way 1,2
Ceph tuning --Journal and tcmalloc
The official CEPH CUTTLEFISH VS BOBTAIL PART 1: INTRODUCTION AND RADOS BENCH
1. Use hierarchical caching layer - Tiered Cache
Obviously this is not a new feature of Ceph, at the meeting of experts in this field have described in detail the principles and use of this feature, as well as details of the error correction code incorporated.
Each cache hierarchy (tiered cache) using a RADOS pool, wherein the cache pool to be copied (to replicated) type, but may be a copy backing pool type can also be the type of error correcting code.
In different cache levels, using different hardware media, the media speed media cache pool used must use than backing pool fast: for example, in backing pool using a general storage medium, such as a conventional HDD or SATA SDD; use cache pool fast medium, such as PCIe SDD.
Each tiered cache uses its own CRUSH rules, so that the data will be written to the different storage media.
librados internal support tiered cache, in most cases it will know the client's data needs to be placed on which layer, there is no need to make changes in the RDB, CephFS, RGW client.
OSD independently handle the flow of data between two levels: promotion (HDD-> SDD) and eviction (SDD -> HDD), however, that the flow of data is expensive (expensive) and time consuming (take long time to "warm up").
2. Better SSD - Intel NVM Express (NVMe) SSD
In Ceph cluster, often use SSD as a Journal (logs) and Caching (cache) media, to improve the performance of the cluster. Below, the use of SSD as the Journal of the cluster than 64K HDD cluster-wide order writing speed increased 1.5 times, while 4K random write speed increased by 32 times.
The Journal and OSD using SSD separated both use the same piece of SSD, can also improve performance. The following figure, both on the same SATA SSD, the performance score to open two SSD (Journal using a PCIe SSD, OSD using SATA SSD), 64K sequential write speeds dropped by 40%, while 4K random write speeds dropped by 13% .
Therefore, a more advanced SSD naturally more improved performance Ceph cluster. SSD development to the present, the media (particles) substantially through three generations, the natural generation is more advanced than a generation, specifically in the higher density (larger capacity) and read and write data faster. Currently, the most advanced is the Intel NVMe SSD, it is characterized as follows:
PCI-e to drive customized standardized software interfaces
Customized for the SSD (PCIe else is done)
SSD Journal: HDD OSD ratio from the conventional 1: 5 to 1:20 raise
The whole SSD cluster, the whole NVMe SSD disk Ceph cluster naturally the best performance, but its cost is too high, and the performance is often limited by the NIC / network bandwidth; so the whole SSD environment, the recommended configuration is to use NVMe SSD Journal do use conventional disk SSD to OSD.
Meanwhile, Intel SSD can also combine Intel Cache Acceleration Software software that can intelligently according to the characteristics of the data, the data on your SSD or HDD:
Test Configuration: Intel NVMe SSD do Cache, use Intel CAS Linux 3.0 with hinting feature (will be released later this year)
Test results: 5% of the cache, so that the throughput (ThroughOutput) submitted doubled delay (Latency) halved
3. Better use of network equipment - Mellanox cards and switches, etc.
3.1 higher bandwidth, lower latency network card device
Mellanox is a company based in Israel, approximately 1,900 employees worldwide, focusing on high-end network equipment, 2014, revenue was 463.6M. (Today just to see the treatment of the company's branch in China is also very good on Mizuki BBS). The main ideas and products:
Scale Out feature Ceph requirements for replicaiton, more sharing and metadata (file) network throughput, lower latency
Currently 10 GbE (Gigabit Ethernet) can no longer meet the requirements of high-performance Ceph cluster (SSD substantially more than 20 clusters can not meet), it has begun to enter 25, 50, 100 GbE era. Currently, 25GbE relatively high cost.
Most network equipment company is using Qualcomm chips, and Mellanox using self-developed chips, the delay (latency) is the industry's lowest (220ns)
Ceph cluster need to use two high-speed network: public network for client access, Cluster network for heartbeat, replication, recovery and re-balancing.
Ceph cluster is currently widely used in SSD, and fast storage devices will need faster network equipment
The actual test:
(1) Test environment: Cluster network using 40GbE switch, Public network distribution equipment using 10 GbE and 40GbE do comparison
(2) Test Results: The results showed that the use of 40GbE equipment throughput cluster is 2.5 times the 10 GbE cluster, IOPS are increased by 15%.
Currently, there are already some companies use the company's network equipment to produce a full SSD Ceph server, for example, SanDisk's InfiniFlash on the use of the company's 40GbE NIC, two Dell R720 server as OSD node, 512 TB SSD, it The total throughput of 71.6 Gb / s, as well as Fujitsu and Monash University.
3.2 RDMA technology
Traditionally, the need to access the hard disk to store tens of milliseconds, and the network protocol stack, and a few hundred subtle. During this period, often using 1Gb / s network bandwidth, use the SCSI protocol to access local storage using iSCSI remote storage access. And after use SSD, time-consuming to access local storage dropped to a few hundred microseconds, therefore, if the network and protocol stack does not raise the same, then they will become a performance bottleneck. This means that the need for better network bandwidth, such as 40Gb / s or even 100Gb / s; still using iSCSI remote storage access, but TCP was not enough, when RDMA technology is introduced. RDMA stands for Remote Direct Memory Access, in order to solve the server-side network transmission latency data processing generated. It is through the network data directly into the computer's memory, it will quickly move data from one system to the remote system memory, without any impact on the operating system, so you do not need much processing power of the computer used. It eliminates external memory copy operation and the exchange of text, which can free up space for bus and CPU cycles used to improve application performance. General practice required by the system to analyze incoming information and mark, and then stored in the correct area.
This technology, Mellanox is the industry leader. Through Bypass Kenerl and Protocol Offload implementation, providing high bandwidth, low latency and low CPU usage. Currently, the company achieved in the Ceph in the XioMessager, so Ceph message does not go away TCP and RDMA, so that it can improve the performance of the cluster, this implementation provides Ceph Hammer version.
4. Use a better software - Intel SPDK related technologies
4.1 Mid-Tier Cache scheme
The program between the client application and the Ceph cluster add a caching layer, so that the client access performance is improved. This layer is characterized by:
Ceph client to provide iSCSI / NVMF / NFS protocols support;
Using two or more nodes to improve reliability;
Added Cache, faster access
Use write log to ensure data consistency across multiple nodes
Ceph RBD connect to the backend using cluster
4.2 Use Intel DPDK and UNS technical
Intel Using this technique, the user space (user space) to achieve a full DPDK card and driver, TCP / IP protocol stack (UNS), iSCSI Target, and NVMe drive to improve the performance of iSCSI access the Ceph. benefit:
Compared with the Linux * -IO Target (LIO), which is only 1/7 of CPU overhead.
User space kernel space NVMe drive ratio VNMe drive CPU utilization to 90% less
A major feature of the program is to use the user mode network card, network card in order to avoid conflicts and kernel mode in the actual configuration, you can SRIOV technology, a virtual multiple virtual physical NIC card, assigned to applications such as OSD. By using the complete user mode technology, to avoid dependence on the kernel version.
Currently, Intel offers Intel DPDK, UNS, Storage stack optimized reference programs, need to use the words and Intel signed a use agreement. NVMe user mode driver is already open.
4.3 CPU data storage acceleration - ISA-L technology
The code library (code libaray) using Intel E5-2600 / 2400 and the new instruction set Atom C2000 product family CPU to achieve the appropriate algorithm to maximize the use of CPU, greatly improving data access speed, however, currently only supports single-core X64 Zhiqiang and Atom CPU. In the following examples, EC several times the speed of increase, overall costs reduced by 25 to 30 percent.
5. The tools and methods to use the system - Ceph performance testing and tuning tools summary
The meeting also issued a number of Ceph performance testing and tuning tools.
5.1 Intel CeTune
Intel this tool can be used to deploy, test, analysis and tuning (deploy, benchmark, analyze and tuning) Ceph cluster, now it has been open source code here. Key features include:
Users can configure the CeTune, using its WebUI
Deployment module: using CeTune Cli or GUI deployment Ceph
Performance test module: support qemurbd, fiorbd, cosbench and so do performance testing
Analysis module: iostat, sar, interrupt, performance counter and other analysis tools
Report Views: support configuration download, icon view
5.2 Common performance testing and tuning tools
Ceph software stack (possible points of failure and tune performance advantages):
Visibility performance tools summary:
Benchmarking tools Summary:
Tuning Tools Summary:
Several methods above, compared with the traditional method of performance optimization, and some have their innovation, which,
Better hardware, including SSD and network devices can naturally lead to better performance, but also a corresponding increase in cost and performance optimization has brought amplitude inconsistency, therefore, between the need scenarios, costs, optimize the effect do tradeoff;
Better software, currently mostly not yet open, but mostly still in beta state, still away from use in a production environment, and are closely tied and Intel hardware;
A more comprehensive approach, it is the majority of Ceph professionals need to conscientiously study and to use in normal use can be more efficiently locate performance issues and find solutions;
Intel investment in Ceph is very large, if customers have Ceph cluster performance problems, the relevant data can also be sent to them, they will provide recommendations accordingly.
Note: All the above are derived from this meeting will be presented as well as information sent after. In this release, if the content is inappropriate, please contact me. Thanks again Intel and RedHat organizing this meeting.