|
Today has entered the era of big data, especially large-scale application of the continuous development of the Internet web2.0 and cloud computing require massive computing and mass storage, the traditional relational database has been unable to meet this demand. With NoSQL database continues to develop and mature, can solve the mass storage application requirements and massive computing. This article focuses on one of NoSQL MongoDB database as described in the application of mass data storage.
1 Introduction
NoSQL, stands for "Not Only Sql", refers to a non-relational database. Such a database has these main features: non-relational, distributed, open source, scalable level. Original purpose of large-scale web applications, this new database early revolutionary movement it was suggested, to the development trend of 2009, more and more high. Data storage is typically non-relational applications, such as: its own mode, support for simple copy, simple API, the final consistency (non-ACID), high-capacity data. Its variety, such as columnar databases (Hadoop / HBase, Cassandra, Hypertable, Amazon SimpleDB, etc.), document database (MongoDB, CouchDB, OrientDB, etc.), the key database (Azure Table Storage, MEMBASE, Redis, Berkeley DB, MemcacheDB, etc.), graphics, database (Neo4J, Infinite Graph, Sones, Bigdata etc.), object-oriented database (db4o, Versant, Objectivity, Starcounter etc.), grid and cloud database (GigaSpaces, Queplix, Hazelcast etc.), XML database (Mark Logic Server, EMC Documentum xDB, BaseX, Berkeley DB XML, etc.), multi-value database (U2, OpenInsight, OpenQM, etc.) and other non-relational databases (such as FileDB) and the like.
MongoDB NoSQL data belongs, it is provided by a company 10gen open source model free, document-oriented storage, distributed database, is a product ranging between relational databases and non-relational databases. By the C ++ language, designed to provide a scalable Web application for the high-performance data storage solutions. He supported the data structure is very loose, is similar to the Bson Json format, so you can store more complex data types.
He runs on Solaris, Linux, Windows and OSX platforms, support 32-bit and 64-bit applications, 32-bit applications in a single database maximum capacity of 2G, in 64-bit applications, storage size only to the actual storage space the size of and provides Java, C #, PHP, C, C ++, Javascript, Python, Ruby, Perl and other languages of the drivers, the latest production version is 2.0, the official download address: http: //www.mongodb.org/ downloads. Currently using his site and companies have more than 100, such as visual China, public comment, Taobao, grand, Foursquare, Wordnik, OpenShift, SourceForge, Github and so on.
As enterprise data continues to accumulate and increase Web2.0 applications and continuous development, we have entered the era of personal information, for medium and large enterprises, may generate large amounts of data every day, returning to the various systems, such as various types document (OA documents, project documentation, etc.), design drawings, high-definition pictures, video, etc., for the employees, are more concerned about personal information storage and computing, when these large enough amount of information, or you want to extract real-time analysis data, traditional centralized manner difficult to meet this demand, so the use of distributed storage and computing become an inevitable choice, on the one hand to solve the main problem of mass storage, on the other hand to solve massive computational problems. Using MongoDB database technology can effectively solve the application, this article focuses on MongoDB distributed application in terms of mass data storage.
2 Overview
The main features of the 2.1 MongoDB
(1) file storage format Bson, using easy to grasp and understand Json style syntax. Json relatively speaking, Bson have better performance, mainly for faster traverse speed, the operation easier and adds extra
type of data.
(2) free mode, support for embedding documents and sub-array, without prior creation of data structures, belonging to the inverse normalized data model, help to improve query speed.
(3) dynamic query support rich query expressions using Json form tag can easily query document embedded objects and arrays and sub-documents.
(4) full index support, including embedded objects documents and data, but also provides full-text indexing, MongoDB query optimizer analyzes a query expression, and generate an efficient query plan.
(5) using an efficient binary data storage for storing large objects (such as high-definition pictures, video, etc.).
(6) supports a variety of replication modes provide redundancy and automatic failover. Support Master-Slave, Replica Pairs / Replica Sets, Limited Master-Master mode.
(7) supports server-side scripting and Map / Reduce, can achieve mass data calculation, that deliver cloud computing capabilities.
(8) high-performance, fast. In most cases, the query speed for MySQL much faster for the CPU usage is very small. Deployment is simple, almost zero configuration.
(9) automatically handles fragmentation, fragmented support functions to achieve the level of scalable database clusters, you can dynamically add or remove nodes.
(10) built GridFS, support for mass storage.
(11) can be accessed over the network, using an efficient MongoDB network protocol is superior in terms of performance or Rest http protocol.
(12) rich third-party support, MongoDB community activists, more and more companies and websites use MongoDB in production environments to optimize technology infrastructure, while providing strong technical support from the company's official 10gen.
Application scenario of 2.2 MongoDB
The main goal of MongoDB is to key / value storage (provides high performance and highly scalable) and traditional RDBMS systems (rich feature) a bridge, set the advantages of both in one.
(1) site data: MongoDB is very suitable for real-time insert, update, query, and have the required replication and highly scalable real-time data storage site.
(2) cache: Because of the high performance, MongoDB also suitable as a caching layer information infrastructure. After the system restarts, set up by the MongoDB persistent caching layer to avoid overloading the underlying data source.
(3) large-size, low-value data: the use of traditional relational database to store some of the data may be more expensive, and before that, a lot of time programmers often choose traditional file storage.
(4) highly scalable scene: MongoDB is very suitable for database consists of tens or hundreds of servers. MongoDB roadmap already contains MapReduce
Engine built-in support.
(5) storing objects and JSON data: MongoDB's Bson format is ideal for storing data and query the document's format.
Architecture of 2.3 MongoDB
MongoDB is a series of physical files (data files, log files, etc.) of the corresponding set of logical structure (collection, documentation, etc.) constitute the database.
MongoDB is actually the logical structure of a hierarchical structure, the document (document, equivalent to a relational database row), collection (collection, equivalent to a relational database table), database (database, the equivalent of a relational database database) this is composed of three parts.
A MongoDB instance supports multiple databases. In MongoDB inside, each database contains a .ns files and data files, using the mechanism of pre-allocated space remains vacant space and additional data files, effectively avoids the data explosion caused by disk stress big problem. Each pre-allocated files are filled with zeros, each newly allocated data files at once, his size will be twice the size of a data file, each data file up to 2G.
2.4 MongoDB and MS SQL Server statement controls
MongoDB provides a feature-rich query expressions, the vast majority can function relational database sql statement to the table employee (id, name, age) illustrate an example control
3 Process Analysis and Testing
3.1 GridFS Overview
Because MongoDB in Bson object size is limited, in the 1.7 version before Bson single object maximum capacity of 4M, version 1.7 later Bson single object maximum capacity of 16M [5]. For general file storage, the storage capacity of a single object 4 to 16M can meet the demand, but can not meet for storing some large files such as high-definition pictures, design drawings, video, etc., so the massive data storage, MongoDB provides built the Grid
FS, you can split a large file into multiple smaller documents, you can specify the file block standards, transparent to the user. GridFS use two data structures to store data: files (containing the metadata object), chunks (binary block containing other relevant information). To make multiple GridFS named a single database, file and block has a prefix, the default prefix is fs, the user has to change this prefix.
GridFS for Java, C #, Perl, PHP, Python, Ruby and other language support program and provides a good API interface.
3.2 GridFS based mass data storage test
This paper uses C # language MongoDB version 2.0 and the latest official drivers for testing, C # driver download address: https: //github.com/mongodb/Mongo-csharp-driver.
MongoDB provides a range of useful tools in the bin directory can be very convenient for operation and maintenance management:
(1) bsondump: Bson dump file format for data Json format.
(2) mongo: client command line tools, support js syntax.
(3) mongod: database server, each instance of initiating a process fork can run in the background.
(4) mongodump: database backup tool.
(5) mongorestore: database recovery tool.
(6) mongoexport: data export tools.
(7) mongoimport: data import tools.
(8) mongofiles: GridFS management tools, enabling access binary files.
(9) mongos: slice route, if you use sharding function, the application is connected mongos, rather than mongod.
(10) mongosniff: the role of this tool is similar to tcpdump, the difference is he only relevant monitoring MongoDB package request, and is specified in the form of readable output.
(11) mongostat: Real-time performance monitoring tools.
At the same time there are several client graphical tools provided by third parties, such as MongoVUE, RockMongo, MongoHub etc., to facilitate management and maintenance.
GridFS combine fragmented and automatic replication technology, can achieve high-performance distributed database cluster architecture, thereby performing mass data storage
MongoDB Sharding Cluster need three main roles:
(1) Shard Server: the actual data that is stored in fragments, each Shard can be a mongod instance, it can be a set of Replica Set mongod example of the construction.
(2) Config Server: used to store configuration information for all nodes shard, shard key range of each chunk, chunk each shard in the distribution of the cluster and the collection of all DB sharding configuration information.
(3) Route Process: This is a front-end route, whereby the client access, then need to ask Config Servers on which shard queries or save the record, and then connect the appropriate shard operation, the final results are returned to the client, all this is transparent to the client, the client does not matter whether operation records are stored on which shard.
In order to facilitate the testing, the following build a simple Sharding Cluster on the same physical machine
Configure the test environment are as follows:
Shard simulate two servers and a Config servers are running on the machine 127.0.0.1, just different ports:
(1) Shard Server1: 127.0.0.1: 27020.
(2) Shard Server2: 127.0.0.1: 27021.
(3) Config Server: 127.0.0.1: 27022.
(4) Route Process: 127.0.0.1: 27017.
Related start service process:
c: \ mongodb 2.0.0 \ bin> mongod --shardsvr --dbpath "c: \ mongodb 2.0.0 \ db" --port 27020
d: \ mongodb 2.0.0 \ bin> mongod --shardsvr --dbpath "d: \ mongodb 2.0.0 \ db" --port 27021
e: \ mongodb 2.0.0 \ bin> mongod --configsvr --dbpath "e: \ mongodb 2.0.0 \ db" --port 27022
e: \ mongodb 2.0.0 \ bin> mongos --configdb 127.0.0.1:27022
Configuration Sharding:
(1) e: \ mongodb 2.0.0 \ bin> mongo
(2) use admin
(3) db.runCommand ({addshard: "127.0.0.1:27020", allowLocal: 1,
maxSize: 2, minKey: 1, maxKey: 10})
(4) db.runCommand ({addshard: "127.0.0.1:27021", allowLocal: 1, minKey: 100})
(5) config = connect ( "127.0.0.1:27022")
(6) config = config.getSisterDB ( "config")
(7) ecDocs = db.getSisterDB ( "ecDocs")
(8) db.runCommand ({enablesharding: "ecDocs"})
(9) db.runCommand ({shardcollection: "ecDocs.filedocs.chunks", key: {files_id: 1}})
(10) db.runCommand ({shardcollection: "ecDocs.filedocs.files", key: {_id: 1}})
EcDocs above refers to the database name, filedocs refers to a user-defined set of files GridFS name, set the default file system named fs.
Official use of C # drivers need to reference MongoDB.Driver.dllMongoDB.Bson.dll in the program, add the same file to GridFS circulating sample code
Test configuration environment as follows:
OS: WindowsXP Professional 32-bit SP3.
Processor (CPU): Intel Xeon (Xeon) W3503@2.40GHz.
Memory: 3567MB (DDR31333MHz / FLASH).
Hard drive: Seagate ST3250318AS (250GB / 7200 rev / min).
Since the machine is a 32-bit operating system, supports only a single service instance GridFS file size capacity of about 0.9G, thanks to two Shard service instance can support a total capacity of stored files size is about 1.8G, if it is 64 bit operating system will not have this limitation.
In this paper, using cyclic test GridFS insert large file performance and capacity size of the fragment, as shown in the test results, as in Figure 5.
As can be seen from Figure 5, the first 1-3 steps, just add a single file, Shard2 did not produce sliced data only when the test was continuously added to step 4 100 to generate the same files Shard2 sliced data and add thirty-four Fast single file, just over 11 seconds to complete the operation, and even through the file-copy such large files also requires at least two thirty seconds to complete, visible MongoDB has a very high performance in large-capacity file storage.
By mongo tool client input db.printShardingStatus () command to view detailed fragmentation case, as shown in Figure 6 below.
As it can be seen from Figure 6, in shard1 assigned a six chunks, in shard2 assigned seven chunks, sliced data is still relatively uniform.
From the above test that can be used GridFS can store huge amounts of data, and can be an inexpensive large-scale database cluster server, very easy to scale-out deployment, the program code is also very easy, it is possible to effectively support the cloud storage applications, to meet the large-scale data storage applications.
4 Conclusion
With the continuous expansion of business and personal data, with the rapid development of cloud computing, more and more applications need to store huge amounts of data, and high concurrency and process massive amounts of data put forward higher requirements, the traditional relational database for these scenarios can not meet the application requirements, and as one of NoSQL databases MongoDB database able to fully meet and resolve application in the mass data storage, more and more large companies have chosen the site and replace the Mysql MongoDB for storage. |
|
|
|