Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Database \ HBase Application Development Review and Summary of Series     - imp / exp Oracle Database import and export commands (Database)

- Introduces Linux kernel compilation system and compiler installation (Linux)

- MySQL service failed to start thinking of settlement under CentOS7 (Database)

- Installation and use GAMIT / GLOBK Software (Linux)

- Use Ansible installation NGINX and NGINX Plus (Server)

- Linux initialization init system - UpStart (Linux)

- Install MATE desktop environment adjustment tools Mate Tweak 3.3.6 (Linux)

- Nginx logging client ip (Server)

- Linux System Getting Started Tutorial: Installing Brother printer in Linux (Linux)

- ARM constant expression (Programming)

- Linux partition command (Linux)

- Linux system installation and usage instructions Wetty (Linux)

- How to identify memory leaks in Java (Programming)

- ORA-00911 invalid character error Solution (Database)

- Oracle 11G using DG Broker create DataGuard (Database)

- Create, modify, delete users, user groups under linux (Linux)

- Python objects (Programming)

- RabbitMQ installation, configuration, monitoring (Linux)

- How do you turn on and off IPv6 address on Fedora (Linux)

- Java rewrite equals method (Programming)

 
         
  HBase Application Development Review and Summary of Series
     
  Add Date : 2017-01-08      
         
         
         
  Outline

I contacted my research HBase there for six months, and although not thorough and systematic, but at least a relatively addicted. As a department Pathfinder big data technology, it also assumed the responsibility for the spread of technology, so the groping research process is always constantly review and testing, along the way, and slowly accumulated some things, organize for a moment, he made a series of technical documents, temporarily called "HBase application development review and summary." Although inscrutable technology can not be called, but the spirit of open source and the spirit of sharing, I was very happy I will stick it out piece by. In addition, I believe that "HBase Definitive Guide" relatively good HBase aspects of the technical books

Here is the catalog describes the series of the document:

The first chapter HBase design specifications

Recommendation describes HBase application development time to follow design specifications, mainly for the development level.

Chapter RowKey row button design specifications

Introduced RowKey row key part of the design specifications and characteristics, of course, the specific design of the line key, or to follow a specific business, and with a wealth of design experience.

Chapter III RowKey row key generator

The design of a key generator RowKey row, row key generation strategy could be developed by way of a manual interface, and this policy document sequence into a local file, the file can also be deserialized into local policy Policy object This policy objects batch row dynamically generated key information.

Chapter IV HBase configuration management class interface design

HBase configuration design tools, including how to load the configuration file to read hbase, and generate Configuration objects, etc.

Chapter V HBase table information management interface design

HBase table design information management tools, including namespaces, information tables, columns, and other family management interface.

Chapter VI HBase table write data interface design

Designed several HBase write data model classes, model classes through which to facilitate the development of organizational data and writes HBase database.

Chapter VII HBase table read data interface design

Designed several reading HBase data model classes, it introduces a variety of data retrieval programs, including batch retrieval, range search, retrieval, etc. version.

Chapter VIII HBase Filter Application Design

It describes several types of commonly used filters and filter should be used with attention to the details, including pagination filters, prefix filters.

Chapter IX HBase lightweight ORM design

Hibernate object mapping imitation, did a lightweight ORM design for HBase, and did not give much thought to practicality, but a prototype test.

Chapter HBase table data browser

Several sections of the comprehensive application of the above, the design of a HBase table data browser includes a table of information navigation, conditional pagination queries, multi-version query.

1. HBase design specifications

Here talking about the design specifications is a bit uppity, after all, the author himself is a big data technology beginners, categorically can not formulate what design specifications, so please forgive my arrogance, the design specifications, only my own formulation and it has nothing to do with others.

Until, HBase and a large number of official expert has summed up part of HBase design specification, the author was collected, together with their own understanding and rich, we put together a feel for their own development should follow the norms .

Hbase table structure associated with the logical model deals with the following words: namespace, tables, columns, Group column, row key, version, etc. These are all elements to build hbase table. The author based on several key words, statements under the relevant specifications.

1.1. Namespace namespace design

More simply, the namespace can be regarded as the group table (with the Oracle table space similar), division basis is not fixed, can be based on business type, it can also be divided according to the time period. For example, for the power of meteorological data table, you can create a namespace power weather, named DLQX, will power the weather-related tables are organized in this namespace below. The introduction of the namespace advantage is the convenience of the organization and management of the table.

HBase default namespace is the default, by default, if you do not explicitly specify the namespace when creating the table, the table will be created in the default namespace. If the table belongs to a non-default namespace, then in the reference list (for example, read table data), you must specify the namespace, otherwise it will appear similar to "Can not locate table" error, the format is complete table name "namespace name: table name" such as "DLQX: SYSTEM_USER"; if it is the default namespace, the full table name can be omitted "default:", the spelling of the table name directly SYSTEM_USER can.

Relationship namespace table

Between the namespace and table-many relationship, that a namespace can contain multiple hbase the following table, but a table hbase only belong to one namespace. When you create a table, if you do not specify a namespace (or namespace is empty), the system will be placed in this table hbase default namespace (default) down.

Also, before you delete a namespace, you must first delete all hbase table this namespace will not be able to remove this namespace.

1.2. Table Table Design

HBase has several advanced features, you can use in the design table. These characteristics are not necessarily linked to the mode or row button design, but they define the behavior of some aspects of the table.

1.2.1 over HBase tables

Hbase database as a column, according to the official statement, in terms of performance and efficiency are better at handling "tall and thin" form rather than "short and fat" table. The so-called "high and thin", it refers to a smaller number of columns in the table, but the maximum number of rows, so that the table show a tall, thin image. The so-called "short and fat" refers to the table data columns in the majority, but has a limited number of lines, giving a short and fat image, although known as hbase table can accommodate one million, but that is only limited to theory limits on the practice, try to build a "tall and thin" list, while the number of columns needed to be tested, in order to avoid excessive influence of read and write performance.

1.2.2 pre-created partition

By default, when creating HBase table will automatically create a region partition, when the import data, all of HBase client to write data on this region until the region is large enough to be segmented. A way to speed up the bulk write speed is through pre-created some of the empty regions, so that when data is written to HBase, will follow the region partitioning, data do load balancing within the cluster.

1.2.3 Number column family

Do not define a table too many column family. Currently Hbase and does not handle more than 2 to 3 column family table. Because of a column family in flush when it adjacent column family will relate the effect is triggered flush, eventually causes the system to produce more I / O. Therefore, according to the official recommendations, a HBase table to create a column family.

Data block size 1.2.4 Configurable

HFile data block size can be set in the column family level. This data block is different from the HDFS data block. The default value is 65,536 bytes or 64KB. Start key index stores each data block HFile data block. Data block size settings affect the data block index is small. The smaller the block, the greater the index, which takes up more memory space. And because the data loaded into memory block smaller, randomly find better performance. But if you need better performance scanning sequence, then once more able to load data into the memory HFile it is more reasonable, which means that the data block should be set to a larger value. Accordingly, the index becomes smaller, you will pay the price in the random read performance.

1.2.5 Data block cache

Read the data into the cache, but the workload is often not derive performance. For example, if a table or column family table is accessed only sequential scan or rarely visit, you will not mind spending time whether the Scan or Get a little longer. In this case, you can choose to turn off that column families cache. If you just perform many sequential scans, you will repeatedly shift the cache, and the cache can be abused to enhance performance should get into the cache data to the crowding out. If you turn off caching, you not only avoid this from happening, but also allows more other columns to other tables and ethnic cache using the same table.

1.2.6 Radical Cache

You can choose a number of column families, giving them a higher priority (LRU cache) in a data block cache. If you expect a column family than another column family random read more, this feature need them sooner or later.

Default IN_MEMORY parameter is false. Because HBase column family in addition to saving the data block cache compared to other more radical than the column family does not provide additional assurance that the parameter is set to true will not change much in practice.

When you create a table, you can HColumnDescriptor.setInMemory (true) will put RegionServer cache table to ensure that when read by cache hit.

1.2.7 Bloom filter (Bloom filters)

Data block index provides an effective way to access a particular row of data used to find HFile block should read. But its usefulness is limited. The default size HFile data block is 64KB, this size can not be adjusted too.

If you want to find a short line, indexing can not give you fine-grained indexing information only on the start line key the entire data block. For example, if your line occupies 100 bytes of storage space, a 64KB data block contains (64 * 1024) / 100 = 655.53 = ~ 700 lines, and you can only put the starting line in the index position. You want to find the line may fall line intervals on a particular piece of data in, but it certainly is not stored in the data block. There are a variety of possible situations, the bank or the table does not exist, or stored in another HFile, even in MemStore inside. Under these circumstances, from the hard disk to read the data block will bring IO overhead will misuse the data block cache. This affects performance, especially when you are faced with a huge data set and there are many concurrent read user.

Bloom filter allows you to store data in each data block to do a reverse test. When a row is requested, check the Bloom filter to see whether the line is not in the data block. Bloom filter or determine the answer to this is not in line, or it does not know the answer. That's why we call it the reverse test. Bloom filter can also be applied to the row in the unit. When accessing a column identifiers before using the same reverse test.

Bloom filter is not without cost. This extra level of storage indexes take up extra space. Bloom filter objects with their index data grows, so Bloom filter row-level identifier Bloom filter stage space less than the column. When space is not a problem, they can help you drain the potential performance of the system.

You can open the Bloom filter in the column family, as follows:

hbase (main)> create 'mytable', {NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}

Default BLOOMFILTER parameter is NONE. A Bloom filter used ROW row-level open, column-level identifier Bloom filter used ROWCOL open. Row-level bloom filter check whether a particular row of keys is not present, the column-level identifier Bloom filter checks whether the row and column identifiers Commonwealth does not exist in the data block. ROWCOL Bloom filter is higher than the cost of ROW Bloom filter.

1.2.8 Time to Live (TTL)

Applications often need to remove old data from the database. Since the database is difficult to exceed a certain size, so the traditional database built on many flexible approach. For example, in TwitBase where you do not want to delete any posts to push users to use the system during the application generates. These are user-generated data, could someday be useful when you perform some advanced analysis. But you do not need to save all the signatures to push for real-time access. So push the posts earlier than a time can be stored in flat files in the archive.

HBase can let you in a few seconds the inner column family level setting a TTL. Data older than the specified TTL value will be deleted when the next big merger. If you have multiple versions of the same unit of time, earlier than the set TTL version will be deleted. You can turn off the TTL or by setting its value INT.MAX_VALUE (2147483647) to get it open forever (this is the default). You can set the TTL when construction of the table, as follows:

hbase (main)> create 'mytable', {NAME => 'colfam1', TTL => '18000'}

This command sets the colfam1 column family TTL of 18,000 seconds = 5 hours. colfam1 data in more than five hours when a big merger will next be deleted.

1.2.9 Data Compression

HFile can be compressed and stored in the HDFS. This helps to save hard disk IO, but the compression and decompression will raise the CPU utilization when reading and writing data. Compression is part of the table definition, can be set to build the table or schema changes. Unless you are sure will not benefit from compression, we recommend that you open a compressed table. Only the data can not be compressed or because the CPU utilization of the server for some reason there are limits in the case, may turn off compression feature.

HBase can use a variety of compression encoding, including LZO, Snappy and GZIP. LZO [1] and Snappy [2] are the two most popular of them. Snappy released in 2011 by Google, and released shortly Hadoop and HBase start of the project to provide support. Prior to this, the encoding LZO is chosen. Hadoop native library using LZO by GPLv2 copyright control, can not be placed in any release of Hadoop and Hbase in; they must be installed separately. On the other hand, Snappy has a BSD license (BSD-licensed), so it's easier and bundled Hadoop and HBase releases. Snappy and LZO compression ratio and compression / decompression speed almost.

When construction of the table you can turn compression on the column families, as follows:

hbase (main)> create 'mytable', {NAME => 'colfam1', COMPRESSION => 'SNAPPY'}

Note that only the data on the hard disk is compressed. In memory (MemStore or BlockCache) or network transmission is not compressed.

Changing the compression coding practices should not happen often, but if you do need to change a column family coding, can be done directly. You need to change the table definition, set up a new compression encoding. Thereafter the merger, the resulting HFile will all adopt the new coding compression. This process does not need to create a new table and copy the data. But you want to make sure to remove all the old HFile old coding libraries from the cluster after being merged until change encoding.

1.2.10 Data Segmentation

In HBase, data is first written to WAL log update (HLog) and memory (MemStore), the data is sorted in MemStore when MemStore accumulated to a certain threshold, it will create a new MemStore, and the old MemStore added to flush the queue, by a separate thread to flush the disk to become a StoreFile. At the same time, the system will record a redo point in zookeeper, it indicates the time before the changes have been persistent (minor compact).

StoreFile is read-only, once created, it can not be modified. So in fact, it is constantly updated Hbase additional operations. When a Store in StoreFile reaches a certain threshold, it will conduct a consolidated (major compact), with a key modification will merge together to form a large StoreFile, when StoreFile size reaches a certain threshold, will for StoreFile split (split), divided into two StoreFile.

Since the update on the table is constantly added, when processing the read request, the need to access all StoreFile Store and MemStore, they will be merged in accordance with the row key, and due StoreFile MemStore are sorted and indexed memory with StoreFile usually merge process is relatively fast.

Practical application may be considered if necessary, manually major compact, with modifications will be merged to form a row key is a large StoreFile. At the same time, you can set StoreFile bigger, reduce split occurred.

Version 1.2.11 unit time

HBase by default each time unit maintains three versions. This property can be set. If you only need one version, it is recommended that you set the table only when a maintenance release. So the system does not retain a plurality of time updating unit version. Time version is also in the column family level settings, you can instantiate Table settings:

hbase (main)> create 'mytable', {NAME => 'colfam1', VERSIONS => 1}

You can specify the same statement in a create multiple attributes for a column family, as follows:

hbase (main)> create 'mytable', {NAME => 'colfam1', VERSIONS => 1, TTL => '18000'}

You can also specify the minimum amount of time the version number column family is stored as follows:

hbase (main)> create 'mytable', {NAME => 'colfam1', VERSIONS => 5,

MIN_VERSIONS => '1'}

In the column while setting the TTL family also come in handy. If the version currently stored at all times that are earlier than TTL, at least MIN_VERSION most recent version will be retained. This ensures that your queries and data earlier than the TTL when results are returned.

1.3. ColumnFamily column family design

Column family for grouping multiple columns, based on the packet is not fixed. Although theoretically a HBase table can create multiple column families, but the official suggested a HBase table do not create more than one column family. After testing, writing and reading efficiency single column family is much more than when a plurality of column families. When storing a column family is stored into a StoreFile, a plurality of files corresponding to a plurality of column families will result in greater pressure on the server at the time of division. It is recommended that a column of the table to create a family.

Family name of the column not too long, because each column are stored in the fight above the family name, long column families will waste more storage space.

When you delete a column family, it will also remove the column and the column value data column families under.

When you create a table, at least to create a column family. After you create a table, you can add multiple column families.

Version is the version for the column family is concerned, if a table has more than one column family, you can set a different family each column for the version number. For example, Group A column allows up to five versions, column B family up to three versions.

1.4. Qualifier column design

HBase and a traditional relational database obvious difference is that when you create a table do not need to create columns, but dynamically create columns when writing data. And wherein the air column does not really take up storage space.

Column content is encapsulated into KeyValue object from a plurality of information can be obtained as follows:

// Row key
String rowKey = Bytes.toString (kv.getRow ());
// Column family
String family = Bytes.toString (kv.getFamily ());
// Column Name
String qualifier = Bytes.toString (kv.getQualifier ());
// Column values
String value = Bytes.toString (kv.getValue ());
//version number
long timestamp = kv.getTimestamp ();

1.5. Design version

If a column family table relates to the problem of multiple versions, you must specify MaxVersions when creating column families. Although, HBase default version number is 3, but you need to specify when you create a table, you can still save a version as HBase will think you do not want to enable multi-family row versioning mechanism.

You can specify the version number when writing data, if you do not specify a version number, then the default version number, that timestamp.

When reading data, if you do not specify a version number, the latest version will only read data, not the data of the latest version.

1.6. HBase naming convention

project

Explanation

Example

Namespaces

Using English words, combinations of Arabic numerals, which word must be capitalized, and the first letter must be in English characters, not numbers.
Not recommended for splicing multiple words with hyphens (underscore), the first letter of a word may be a plurality of stitching simple semantics can be a single word, complex semantics.
Try to limit the length between 4 to 8 characters.
Namespace generally consistent with the project name, organization name, etc.
According to the project to build a namespace name: DLQX (electricity meteorological initials splice form), brief and clear.
Not recommended for long namespace name, for example, it is not recommended in the following form: USER_INFO_MANAGE like.
Table name

Using English words, Arabic numerals, hyphen (_) in the form of a combination, wherein, the word must be capitalized, and the first letter must be in English characters, numbers can not be used more than one word splice connector.
Try to limit the length of between 8 to 16 characters.
Maximize the use of English words with a clear meaning, and not recommended Phonetic alphabet letters or initials.
Compliant table name:
USER_INFO_MANAGE,

WEATHER_DATA,

T_ELECTRIC_GATHER like.

Column Family Name

Using English words, combinations of Arabic numerals, which word must be capitalized, and the first letter must be in English characters, not numbers.
Try to limit the length of between 1 to 6 characters, the family name is too long column will take up more storage space.
Compliant Column Family Name:
D1, D2, DATA and so on.

Not recommended columns family name:
USER_INFO, D_1 like.

Column Name

Using English words, Arabic numerals, hyphen (_) in the form of a combination, wherein, the word must be capitalized, and the first letter must be in English characters, numbers can not be used more than one word splice connector.
Try to limit the length of between 1 to 16 characters.
Maximize the use of English words with a clear meaning, and not recommended Phonetic alphabet letters or initials.
Compliant column names:
USER_ID, DATA_1, REMARK like.

Not recommended column names:
UserID, 1_DAT

2. RowKey row key design specifications

2.1. Four characteristics RowKey

2.1.1 string type

While the row of keys is byte [] array of bytes stored in HBase in, but it is recommended in the system development process in its data type is set to String type, ensure versatility; if during development RowKey provisions for other types of , such as the length of Long data type, the length of the data will be limited by the compiler environment may require.

Common line key strings are the following:

Pure string of numbers, such as 9559820140512;
Digital + special delimiters, such 95598-20140512;
Digital Plus letters, such as city20140512;
English letters + numbers + special delimiters, such city_20140512.
2.1.2 there is a clear sense

RowKey main role is to be uniquely marked data records, but not all of its unique properties, with a clear line of key significance for application development, data retrieval and so has a special significance. For example, the above string of numbers 9559820140512, its practical significance is this: 95598 (grid customer service phone) +20140512 (date).

Row of keys is often a combination of multiple values from, and the position of the respective order value will affect the efficiency of data storage and retrieval, so the design line key, the need for future business application development more in-depth understanding and forward prediction, You can try to design a high efficiency rows retrieved key.

2.1.3 has orderliness

RowKey is stored according to the dictionary order, therefore, the design RowKey, we should take full advantage of the characteristics of this sort, data storage will often go to a reading, the most recent data may be accessed on a piece.

For example: If you have recently written data HBase table is most likely to be accessed, the time stamp can be considered as part of RowKey, as is the lexicographic ordering, you can use Long.MAX_VALUE - timestamp as RowKey, this will ensure the newly written data can be read quickly hit.

2.1.4 has a fixed length of

Line keys with orderly basis is fixed length, such as 20140512080500,20140512083000, both date and time in the form of a string is incremented, regardless of how much is the number of seconds later, we set it to 14 in digital form, If we go back in addition to 0, it will be greater than 20,140,512,083 201,405,120,805, its orderliness have changed. Therefore, we suggest, the line must be designed to set the key length.

 

 

2.2. RowKey Design Principles

2.2.1 RowKey length principle

Rowkey is a binary stream, Rowkey length by many developers suggested that the design of 10 to 100 bytes, but the proposal is as short as possible, not more than 16 bytes.

The following reasons:

Persistent file HFile (1) Data is stored in accordance with KeyValue if Rowkey such as 100 bytes long, 10 million light Rowkey data will occupy 100 * 10 million = one billion bytes, nearly 1G data, this will greatly affect the storage efficiency HFile;

(2) MemStore part of the data to the cache memory, if the effective utilization Rowkey field is too long will reduce the memory, the system will not cache more data, which reduces the efficiency of retrieval. So Rowkey byte length as short as possible.

(3) are currently operating system is 64-bit system, memory, 8-byte alignment. Control in 16 bytes, use an integer multiple of 8 bytes of the best features of the operating system.

 

2.2.2 RowKey hash principle

If Rowkey way is by timestamp increment, not the time in front of binary code, it is recommended Rowkey high as the hash field, generated by the program cycle, Longwall time field, which will increase the data evenly distributed in each Regionserver load balancing chance. If there is no hash field, the first field is time information directly generate any new data on a RegionServer accumulation of hot phenomenon, so do the data retrieval when the load will focus on individual RegionServer, reducing query efficiency.

 

2.2.3 RowKey only principle

We must ensure uniqueness in design.

 

 

2.3. RowKey scenarios

Based on the above three principles Rowkey, and respond to different scenarios have different Rowkey design proposals.

2.3.1 RowKey designed for transaction data

Transaction data with the time attribute, recommendations will be credited to Rowkey time information, which will help prompt query retrieval speed. For transaction data on a daily default recommendation for the construction of the table data, the benefits of this design are manifold. Press talent table, the time information can be removed leaving only the date portion hours minutes milliseconds, so you can get 4 bytes. Plus hash field 2 bytes bytes can be composed of a total of six unique Rowkey. As shown below:

Transaction data Rowkey design

Byte 0

1 byte

The first 2 bytes

The first 3 bytes

The first 4 bytes

The first 5 bytes

...

Hash field

Time field (ms)

Extension field

0 ~ 65535 (0x0000 ~ 0xFFFF)

0 ~ 86399999 (0x00000000 ~ 0x05265BFF)

 
This design can not save money from the operating system memory management level, because 64-bit operating system must be 8-byte aligned. But for persistent storage Rowkey section can save 25% overhead. Some people may ask why not use the time field to save the host byte order, so that it can be used as a hash field. This is because the data in the time range or try to ensure the continuous, probabilistic data within the same time frame to find a large, query retrieval have good results, and therefore a separate hash field better, for certain applications, we You can consider the use of all or part of the hash field to store some of the data field information, as long as the same hash value at the same time (in milliseconds) unique.


2.3.2 RowKey designed for statistical data

Statistical data is the smallest unit of statistical data with Time attribute only to the minutes (to the second pre statistics no sense). At the same time the statistics we use the default data points table by day, the benefits of this design does not need to say any more. Press talent table, the time information to keep only minutes to hours, then 0 to 1400 only occupies two bytes to save time information. Since the statistical data for some dimensions very large number, requiring 4 bytes as a sequence field, so the hash field at the same time as using the sequence field is 6 bytes unique Rowkey. As shown below:

Statistics Rowkey design

Byte 0

1 byte

The first 2 bytes

The first 3 bytes

The first 4 bytes

The first 5 bytes

...

Hash field (sequence field)

Field time (min)

Extension field

0x00000000 ~ 0xFFFFFFFF)

0 ~ 1439 (0x0000 ~ 0x059F)

 
Also such a design can not save money from the operating system memory management level, because 64-bit operating system must be 8-byte aligned. But for persistent storage Rowkey section can save 25% overhead. Pre statistics may relate to repeatedly re-calculation of requirements, the need to ensure that invalid data can be effectively removed at the same time not affect the balance of the effect of the hash, and therefore special treatment.

 

2.3.3 RowKey designed for general purpose data

General data is incremented sequence as a unique primary key, the user can select by day to build sub-table can also select a single table mode. This model requires a plurality of storage while ensuring hash field runtime load module (sequence field) is unique. Could be given to the different load module is given a unique factor difference. Design of the structure is shown below.

Through design data Rowkey

Byte 0

1 byte

The first 2 bytes

The first 3 bytes

...

Hash field (sequence field)

Extension field (controlled within 12 bytes)

0x00000000 ~ 0xFFFFFFFF)

Fields by multiple users

 

2.3.4 Support RowKey design of multi-criteria query

HBase Get Records specified conditions, is used scan method. scan method has the following characteristics:

(1) scan speed can be increased by setCaching and setBatch method (space for time);

(2) scan may be defined by the scope and setStartRow setEndRow. The smaller the range, the higher the performance.

Through clever design makes us get bulk RowKey recording elements of the collection put together (it should be in the same Region under), you can get good performance when traversing results.

(3) scan can add filters through the setFilter method, which is tab-based, multi condition queries.

After meeting length, three, the only principle, we need to consider how to take advantage of the ingenious design RowKey scan methods range feature, so get a number of records can improve query speed.

The so-called RowKey key generator line, the line refers to the development of key generation strategy through software tools, and policy information will save costs to the policy document, to be needed and then the local policy file serialized row key generation strategy objects, incoming data line information after automatically generates RowKey row keys.

So, why should this line key generator design it? The first time, we have a demand put Oracle in a number of large tables of data into HBase, then there is a problem here: so many tables, RowKey generation rules for each table is not the same, do we to have designed a line of key generation method for each table it? !

Of course not, we have to do more with less, or once and for all things to solve this problem, so we can think of a row key generator design tools, so developers can develop some strategies to generate files manually, and can play these policy files into a jar package file for distribution.

Below I will detail next design ideas.

First, HBase rows key information is often made of a combination of a plurality of data, and in most cases are based on the column field existing relational database table information. For example, now we want the data information PUBLISH_DATA_INFO (distribution information table) into HBase table by the row key "PUBLISH_TIME" and "DATA_TYPE" composed of, Well, now we are the first line of data to determine the key information source.

 

Second, we reiterate a few key lines under HBase generate principles: fixed-length, uniqueness, etc. So, we have to row composed of key data formatted conventional formatting process in the following ways: removing spaces, replace the special characters, padded front, rear padded character upside down, etc., used in java technology is nothing more than a few methods: trim, replace, substring, and so on. Of course, for some special cases, you can also use regular expressions for processing. The author of these formatting process referred to as the configuration policies.

 

Third, HBase after row key generation strategy to develop the information, you need to save persistent, so that other people and systems. There are many ways to save, for example, save it to Oracle or Mysql database table, you can make it unique, but also through a network shared by multiple users and systems is the best way to preserve. You can also serialize it into a local file (xml or json file, etc.), and now I designed this version is the row key generation strategy information serialized json file saved locally. As follows:

[{ "DATA_TYPE": "DATA_TYPE", "PUBLISH_TIME": "PUBLISH_TIME"}, { "columnName": "PUBLISH_TIME", "length": 14, "numberStep": 1, "prefixChar": "", "prefixNumber" : 0, "replaceChar": "", "replaceSourceChar": "-:.", "splitChar": "", "startNumber": 1, "suffixChar": "0", "suffixNumber": 0, "value" : "2015-12-26 12:24:00"}, { "columnName": "DATA_TYPE", "length": 4, "numberStep": 1, "prefixChar": "", "prefixNumber": 0, " replaceChar ":" "," replaceSourceChar ":" "," splitChar ":" "," startNumber ": 1," suffixChar ":" 0 "," suffixNumber ": 0," value ":" D1 "}]
Fourth, how to use these lines key generation strategy? At boot time, loaded via the interface method these lines key generation strategy information (file), load it into memory, and then organize and row key field information related to the collection, and passes it to the specified interface methods, the final generate row keys. Sample code is as follows:

// Load Line key local policy document
String policyFilePath = "D: \\ PMS_EQUIP_INFO.policy";
RowKeyPolicy rowKeyPolicy = RowKeyPolicy.openRowKeyGeneratorPolicyFile (policyFilePath);
// Build a data line used for testing
Map row = new HashMap ();
row.put ( "PUBLISH_TIME", "2015-08-12 16:35:00");
row.put ( "DATA_TYPE", "D01");
String rowKey = rowKeyPolicy.getRowKey (row, false);
LogInfoUtil.printLog ( "RowKey =" + rowKey);

row.put ( "PUBLISH_TIME", "2015-09-12 16:35:00");
row.put ( "DATA_TYPE", "D02");
rowKey = rowKeyPolicy.getRowKey (row, false);
LogInfoUtil.printLog ( "RowKey =" + rowKey);

// Print log information as follows
********** RowKey = 20150812163500.D010
********** RowKey = 20150912163500.D020
Fifth, the following is the author of the project file when developing code has not had time optimization, interested friends, you can download it and see.

HBase use Eclipse for application development, the need to identify at least three configuration information in the following table:

#hbase config
#HMaster Service deployment host and port number
hbase.master = hdp-wuyong: 60010
#Zookeeper Port number
hbase.zookeeper.property.clientPort = 2181
#Zookeeper Host information service deployment
hbase.zookeeper.quorum = hdp-songjiang, hdp-lujunyi, hdp-wuyong
We will be more information in the configuration file into Hadoop.config.properties, HBase before the system call interface method, this configuration information to initialize the load.

HBase configuration management of information, we have designed a tool class called HBaseConfigUtil, the main features include initialization loading HBase configuration information, build HBase Configuration instance, check with HBase cluster communication and close links and so on. The main code is as follows:

import java.io.File;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HConnectionManager;
import com.hnepri.common.util.PropertiesUtil;
/ **
* Description: HBase configuration management tools

* Copyright: Copyright (c) 2015

* Company: Henan Electric Power Research Institute Smart Grid

* @ Author shangbingbing 2015-01-01 write
* @ Version 1.0
* /
public class HBaseConfigUtil {
   / **
    * Analytical HBase loading custom configuration information.

    * You need to call this method when the system starts to load custom configuration information, use the default settings or otherwise unable to connect HBase.
    * /
    public static void loadHBaseConfigProperties () {
        HashMap pps = PropertiesUtil.readProperties ( "hbase.config.properties");
        HBaseConfigUtil.setHbaseConfigItemList (pps);
    }
   private static Configuration configuration = null;
   / **
    * Hbase list configuration information, which is stored in the key parameter names, such as master.hadoop; value stored in the parameter values, such as master.hadoop: 60010, etc.
    * /
    private static HashMap hbaseConfigItemList = new HashMap ();
   / **
    * Get a list of configuration information hbase
    * @ Return
    * /
    public static HashMap getHBaseConfigItemList () {
       return hbaseConfigItemList;
    }
   / **
    * Set hbase configuration information list
    * @ Param hbaseConfigItemList
    * /
    public static void setHbaseConfigItemList (HashMap hbaseConfigItemList) {
        HBaseConfigUtil.hbaseConfigItemList = hbaseConfigItemList;
    }
   / **
    * Add hbase configuration information
    * @ Param key
    * @ Param value
    * /
    public static void addHBaseConfigItem (String key, String value) {
       if (hbaseConfigItemList.containsKey (key)) {
            hbaseConfigItemList.remove (key);
        }
        hbaseConfigItemList.put (key, value);
    }
   / **
    * Remove hbase configuration information
    * @ Param key
    * /
    public static void removeHBaseConfigItem (String key) {
       if (hbaseConfigItemList.containsKey (key)) {
            hbaseConfigItemList.remove (key);
        }
    }
   / **
    * Get HBase Configuration objects
    * @ Return
    * /
    public static Configuration getHBaseConfig () {
       if (configuration == null) {
            configuration = HBaseConfiguration.create ();
           try {
               // Winutils.exe solve problems that do not exist
                File workaround = new File ( ".");
                . System.getProperties () put ( "hadoop.home.dir", workaround.getAbsolutePath ());
               new File ( "./ bin") mkdirs ().;
               new File ( "./ bin / winutils.exe") createNewFile ().;
// Conf.addResource ( "hbase-site.xml");
               // Initialize set zookeeper related configuration information
                if (hbaseConfigItemList! = null && hbaseConfigItemList.size ()> 0) {
                   for (String key: hbaseConfigItemList.keySet ()) {
                        String value = hbaseConfigItemList.get (key); configuration.set (key, value);}}} catch (Exception ex) {
                System.out.println (ex.toString ());
            }
        }
       return configuration;
    }
   / **
    * Refresh Reset HBase configuration objects
    * /
    public static void initHBaseConfig () {
        configuration = null;
    }
   / **
    * Close all connections
    * /
    public static void closeAllConnections () {
        HConnectionManager.deleteAllConnections ();
    }
   / **
    * Close the current connection
    * /
    public static void closeConnection () {
        HConnectionManager.deleteConnection (configuration);
    }
   / **
    * Check the condition of the communication client and HBase cluster.
    * @ Return Returns true indicates normal, false representation exception.
    * /
    public static boolean checkHBaseAvailable () {
       try {
            HBaseAdmin.checkHBaseAvailable (configuration);
           return true;
        } Catch (Exception e) {
           return false;
        }
    }
}
     
         
         
         
  More:      
 
- JDK comes with tools JPS (Linux)
- Print Linux system error codes (Linux)
- Kubuntu 14.04 desktop to the user how to upgrade KDE 4.13.2 (Linux)
- Bash added to the Vi mode indicator (Linux)
- Repair CentOS 6.4 Grub boot (Linux)
- Debian GNU / Linux service list acquisition, shutting down services or run (Linux)
- Linux disk virtualization (Linux)
- Three kinds of implementation model of the Linux thread history (Programming)
- Linux screen command (Linux)
- Install snort intrusion detection system on Debian (Linux)
- Use Linux built-in firewall to improve network access control (Linux)
- GNU / Linux enable Intel Rapid Start (Linux)
- Oracle SQL statement tracking (Database)
- Use XtraBackup to MySQL database online incremental backup and recovery (Database)
- Using 30 seconds to write a detailed analysis of garbage removal system (Linux)
- gzip, bzip2, xz, tar, zip compression, archive Detailed (Linux)
- File sharing and fork function (Programming)
- Hadoop 2.6.0 stand-alone / pseudo-distributed installation (Server)
- Build Nginx + uWSGI + Flask operating environment under CentOS 6.4 tutorial (Server)
- Learn to read the source code of vmstat (Linux)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.