Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Programming \ Handle large data problems Bit-map method     - Cryptography development environment to build under Ubuntu (Linux)

- To install Git on Ubuntu systems (Server)

- Use mysqldump MySQL database backup - Linux Shell Scripting (Database)

- Matters Oracle 11.2 single instance when connecting ASM need to pay attention and deal with the problem (Database)

- Erlang concurrency and foundation (Programming)

- Github Remote Assistance (Linux)

- Install mono offline on CentOS (Server)

- Linux reserves the rest of the file to delete several (Linux)

- Multi-core CPU, multi-threading and parallel computation (Linux)

- How to fix Ubuntu / Mint can not add PPA source of error (Linux)

- Graphical development environment to build Android under Ubuntu 11.04 (Linux)

- Use FFmpeg processing high quality GIF (Programming)

- How to create a secure and easy to remember password (Linux)

- Memcached distributed caching (Server)

- Using Libreoffice under ubuntu (Linux)

- Linux System Getting Started Learning: Linux in the last command (Linux)

- Default permissions Linux file and directory permissions and hide - umask, chattr, lsattr, SUID, SGID, SBIT, file (Linux)

- Category prevent DoS attacks against Linux (Linux)

- installation of Piwik under Ubuntu (Programming)

- JBPM6 Tutorial - Fast Fun JBPM table (Linux)

 
         
  Handle large data problems Bit-map method
     
  Add Date : 2017-08-31      
         
         
         
  Problems introduced:

1. 4 billion to integer not repeat unsigned int, and did not sorted, and then give a number, how to quickly determine whether this number 4000000000 that the number of them?
2. Given a set of integers million level the amount of data to determine which elements are repeated.
3. Given an array of shaping a million level the amount of data, sort them.
4. Find Unique integers (note that this is not enough memory to accommodate 500 million integer) at 500 million integers.

Judging from the amount of data, using a conventional method (common sorting algorithm, compare apples to apples, etc.) clearly inappropriate, so here we introduce a new solution is Bitmap.

Bitmap is to use one bit to mark an element corresponding Value, and Key is the bit of the bit sequence. As a result of Bit units to store data, it can greatly save storage space. Through a bit bitmap represents a state, such as: int type is 2 ^ 32 numbers that 4G digits, each digit one state, that is 2 ^ 32 bit, that is 512 MB (that is, with 512 trillion storage space can handle 4G data, namely 40+ one hundred million data).

Here is what I use C ++ to write a bitmap class, you can scale the incoming data object is constructed by dynamic memory required for the application, and handle large amounts of data users:

#include < iostream>
#include < fstream>
#include < ctime>
using namespace std;
const unsigned SIZE = 512000000; // 512 MB static memory data can be processed 4.096 billion

class Bitmap {
    typedef struct Byte {
        unsigned char bit8;
        static const unsigned char mask [9]; // used to get a byte of every auxiliary array
        Byte ()
        {
            bit8 = 0;
        }
        // This bit is set, the number is stored
        void set1 (unsigned at)
        {
            bit8 | = mask [at];
        }
        // Read it if there are several
        bool get1 (unsigned at)
        {
            return bit8 & mask [at];
        }
    } Byte;
    Byte * m_byte;
    unsigned m_size;
public:
    Bitmap (unsigned _size)
    {
        m_byte = new Byte [(_ size + 7) / 8];
        m_size = _size;
    }
    virtual ~ Bitmap ()
    {
        delete [] m_byte;
        m_size = 0;
    }
    A data storage //
    bool push (unsigned data)
    {
        if (data> = m_size)
            return false;
        m_byte [data / 8] .set1 (data% 8);
        return true;
    }
    // Read whether a data exists
    bool find (unsigned data)
    {
        return data> = m_size 0: m_byte [data / 8] .get1 (data% 8);?
    }
    // Returns the number of data that can be stored
    unsigned size ()
    {
        return m_size;
    }
    // Overloaded operators implement common functions
    A data storage //
    bool operator >> (unsigned data)
    {
        return push (data);
    }
    // Read whether a data exists
    bool operator << (unsigned data)
    {
        return find (data);
    }
    // Access to a data block
    Byte & operator [] (unsigned i)
    {
        if (i> = m_size / 8)
            throw "index out of range";
        return m_byte [i];
    }
};
const unsigned char Bitmap :: Byte :: mask [9] = {0x80,0x40,0x20,0x10,0x8,0x4,0x2,0x1}; // used to get a byte of every auxiliary array

int main ()
{
    Bitmap bitmap (8 * SIZE); // can store one hundred million data 40+
    ifstream file ( "in.txt"); // Input integers from file
    unsigned read, i = 0, t1 = clock ();
    for (i = 0; i         if (file >> read)
            bitmap >> read;
        else
            break;
    file.close ();
    cout << "total store" << i / 10000 << "W data," << "Processed:" << clock () - t1 << "ms" << endl;
    t1 = clock ();
    for (i = 0; i <1000000; ++ i)
        if (bitmap << i)
            ;
    cout << "access" << i / 10000 << "W Data Total time:" << clock () - t1 << "ms" << endl;
    cout << "Please enter the data needs to be retrieved:" << endl;
    while (cin >> read) {
        if (bitmap << read)
            cout << "stored" << read << endl;
        else
            cout << "Error: No storage" << read << endl;
    }
    return 0;
}

In the program, first read 6W integers a randomly generated text file, saved to the bitmap, and then test it from this well-established bitmap data looks 100W consuming cases (about 11ms), the next section the user can manually enter some integers, whether the program will automatically retrieve the data stored in the bitmap.

Introduce the topic so that you can solve the first problem, the text data input to the known data can be 4 billion (4 billion data entry may take a while, about 1300 seconds).

Here are problem-solving ideas introduced in the remaining three issues.

Question 2: first create a large enough Bitmap object, and then turn the data entry before entry if a bit has been found that the data is 1, the data duplication, data can be sequentially repeated.

Question 3: first create a large enough Bitmap object, and then sequentially input data traversing Bitmap From the start position, if someone is not 0, it indicates that there is data sequentially output is not zero bit 0 is the sort of order good array (output not much sense to convert the output to a file is written, the new data file is sorted).

Question 4: 1, the establishment of two large enough Bitmap object, followed by data entry before entry to determine whether the data in bitmap1 whether there is (that is, whether the corresponding bit is 1), does not exist, entered into bitmap1, there would Entry to bitmap2 in; after all input sequentially traverse bitmap1 each one, if a bit = 1 but bitmap2 corresponding bit is not 1, it indicates that the data appears only once, to sequentially output.

Method 2, to establish a large enough Bitmap object, but with two showing a data, 00 indicates that the data does not exist, the data 01 occurs once, 10 indicates that the data appear multiple times. 11 it? Side cool go, do not you, ha ha. This in turn when entering data, if the corresponding bit (actually two) is 00 to 01,01 to 10,10 to go on. After the entry is complete, traverse the entire bitmap, to find the 01 outputs.

Well, the common big data topics on the adoption of bitmap this amazing structure to solve, but the bitmap is not a panacea, it is clear, it being only suitable for storing integer data, of course, here only consider the unsigned data type, if it is of type int then click on the corresponding map can also no problem. But even so, it can only handle one billion level data, if a larger amount of data, not just the type of plastic it?

For example: the need to write a web spider (web crawler). Since the link between networks intricate spider crawling across the network is likely to form a "ring." In order to avoid the formation of "rings", you need to know what those spiders have visited URL. To a URL, how do you know if the spider has visited?

Difficult to think of several options as follows:

1. Place all visited URL saved to the database;

2. The visited URL and save them in HashSet. Just close that the cost of O (1) can be found on whether a URL is accessed;

3. URL through MD5 or SHA-1 one-way hash, etc. then saved to the database or HashSet.

4. Bit-Map method. Establish a BitSet, each URL through a hash function to map a bit.

1 to 3 are the visited URL intact, only a mapping method 4-bit tag URL.

In the above method is small amount of data can be the perfect solution to the problem, but when the amount of data becomes very large question came.

Method 1: After the amount of data becomes very large relational database query efficiency will be very low. And each to a URL to start a database query is not too over done?

Method 2: too consume memory. With the increase of the URL, the memory will be more and more. Even if only 100 million URL, each URL count only 50 characters, you need to 5GB of memory.

Method 3: Because the string after the MD5 message digest length processed only 128Bit, SHA-1 treatment after only 160Bit, so the ratio of Tier 3 Tier 2 several times to save memory.

Method 4: memory consumption is relatively small, but the drawback is the probability that a single hash function conflict is too high. Remember the data structure of school class conflict over Hash table various solutions it? To reduce the probability of conflict to one percent, will be the length of the BitSet is set to 100 times the number of URL.

But we can consider if you ignore false positives in the situation to some extent, it is not possible to achieve this functionality through 4 improved method? In fact, this is the thinking Bloom Filter algorithm: Bloom Filter is a function of the Bloom Duo Haxi proposed in 1970 mapped fast search algorithm. Typically used in some need to quickly determine whether an element belongs to the set, but does not strictly require 100% correct occasion. The idea is in the process of 4 on the basis of some improvements, not mapped to one, but by mapping K hash function to the K position, so that only when the new URL calculated K bits are 1 when the judge for the URL has been visited (there is the possibility of false positives, but there are studies to prove, that allows to obtain false positives when appropriate K value and the bitmap bits so small that can be ignored, see details)

Of course, it can be handled by map-reduce, but after all, people mapreduce expert, professional large data processing technology it!
     
         
         
         
  More:      
 
- Linux-based Heartbeat high availability configuration httpd service (Server)
- 2016, the new Node project Precautions (Programming)
- Getting Started with Linux system to learn: how to check in a package is installed on Ubuntu (Linux)
- Oracle 11g RAC manually playing GI PSU patch (11.2.0.4.8) (Database)
- Use DB2 federated access Oracle (Database)
- How to use the process on the desktop xkill end Linux (Linux)
- Linux common network tools: batch scanning of hosting services netcat (Linux)
- Install the free open source financial software GnuCash 2.6.6 under Ubuntu (Linux)
- Linux linux system security (Linux)
- The array of C language (Programming)
- Linux modify environment variables method (Linux)
- Oracle how to maintain the consistency of read? (Database)
- How to Debian Linux the default Python version switch to alternative version (Linux)
- Oracle Client Dedicated and Shared connection mode (Database)
- numpy and SciPy installation under Python for scientific computing package (Linux)
- LMMS 1.03 install on Ubuntu 14.04 (Linux)
- Ubuntu ADSL dial-up Internet access (Linux)
- Expand an existing RAID arrays and remove the failed disk in a RAID (Linux)
- To compile install and test Swift under Linux (Linux)
- Linux dual physical network card set to a virtual NIC (Linux)
     
           
     
  CopyRight 2002-2020 newfreesoft.com, All Rights Reserved.