Home IT Linux Windows Database Network Programming Server Mobile  
           
  Home \ Programming \ Handle large data problems Bit-map method     - How to use Linux to download music from Grooveshark (Linux)

- How ONLYOFFICE collaborative editing document on Linux (Linux)

- The traffic monitoring system: cacti (Linux)

- Linux kernel VLAN study notes (Programming)

- Install Python 3.3.4 under CentOS 6.4 (Linux)

- OGG-01496 OGG-01031 Error Resolution (Database)

- MySQL password on those things you should know (Database)

- Linux with Windows Explorer as a security system (Linux)

- Linux modify the system time (Linux)

- Linux C source code (Ascii HexToBinary: Converts hexadecimal string format ASCII codes) (Programming)

- Some security configuration of Linux systems (Linux)

- Git version rollback (Linux)

- Source code compiled by the installation program under Linux (Linux)

- How to use Java to read OpenOffice document (Programming)

- Linux use additional rights (Linux)

- CentOS 7 How to install MySQL Server (Database)

- Nginx Keepalived Nginx monitoring scripts (Server)

- Linux Proc File System Experiment (Linux)

- Install apr support for Tomcat on Linux (Server)

- Spring declarative transaction management (Programming)

 
         
  Handle large data problems Bit-map method
     
  Add Date : 2017-08-31      
         
       
         
  Problems introduced:

1. 4 billion to integer not repeat unsigned int, and did not sorted, and then give a number, how to quickly determine whether this number 4000000000 that the number of them?
2. Given a set of integers million level the amount of data to determine which elements are repeated.
3. Given an array of shaping a million level the amount of data, sort them.
4. Find Unique integers (note that this is not enough memory to accommodate 500 million integer) at 500 million integers.

Judging from the amount of data, using a conventional method (common sorting algorithm, compare apples to apples, etc.) clearly inappropriate, so here we introduce a new solution is Bitmap.

Bitmap is to use one bit to mark an element corresponding Value, and Key is the bit of the bit sequence. As a result of Bit units to store data, it can greatly save storage space. Through a bit bitmap represents a state, such as: int type is 2 ^ 32 numbers that 4G digits, each digit one state, that is 2 ^ 32 bit, that is 512 MB (that is, with 512 trillion storage space can handle 4G data, namely 40+ one hundred million data).

Here is what I use C ++ to write a bitmap class, you can scale the incoming data object is constructed by dynamic memory required for the application, and handle large amounts of data users:

#include < iostream>
#include < fstream>
#include < ctime>
using namespace std;
const unsigned SIZE = 512000000; // 512 MB static memory data can be processed 4.096 billion

class Bitmap {
    typedef struct Byte {
        unsigned char bit8;
        static const unsigned char mask [9]; // used to get a byte of every auxiliary array
        Byte ()
        {
            bit8 = 0;
        }
        // This bit is set, the number is stored
        void set1 (unsigned at)
        {
            bit8 | = mask [at];
        }
        // Read it if there are several
        bool get1 (unsigned at)
        {
            return bit8 & mask [at];
        }
    } Byte;
    Byte * m_byte;
    unsigned m_size;
public:
    Bitmap (unsigned _size)
    {
        m_byte = new Byte [(_ size + 7) / 8];
        m_size = _size;
    }
    virtual ~ Bitmap ()
    {
        delete [] m_byte;
        m_size = 0;
    }
    A data storage //
    bool push (unsigned data)
    {
        if (data> = m_size)
            return false;
        m_byte [data / 8] .set1 (data% 8);
        return true;
    }
    // Read whether a data exists
    bool find (unsigned data)
    {
        return data> = m_size 0: m_byte [data / 8] .get1 (data% 8);?
    }
    // Returns the number of data that can be stored
    unsigned size ()
    {
        return m_size;
    }
    // Overloaded operators implement common functions
    A data storage //
    bool operator >> (unsigned data)
    {
        return push (data);
    }
    // Read whether a data exists
    bool operator << (unsigned data)
    {
        return find (data);
    }
    // Access to a data block
    Byte & operator [] (unsigned i)
    {
        if (i> = m_size / 8)
            throw "index out of range";
        return m_byte [i];
    }
};
const unsigned char Bitmap :: Byte :: mask [9] = {0x80,0x40,0x20,0x10,0x8,0x4,0x2,0x1}; // used to get a byte of every auxiliary array

int main ()
{
    Bitmap bitmap (8 * SIZE); // can store one hundred million data 40+
    ifstream file ( "in.txt"); // Input integers from file
    unsigned read, i = 0, t1 = clock ();
    for (i = 0; i         if (file >> read)
            bitmap >> read;
        else
            break;
    file.close ();
    cout << "total store" << i / 10000 << "W data," << "Processed:" << clock () - t1 << "ms" << endl;
    t1 = clock ();
    for (i = 0; i <1000000; ++ i)
        if (bitmap << i)
            ;
    cout << "access" << i / 10000 << "W Data Total time:" << clock () - t1 << "ms" << endl;
    cout << "Please enter the data needs to be retrieved:" << endl;
    while (cin >> read) {
        if (bitmap << read)
            cout << "stored" << read << endl;
        else
            cout << "Error: No storage" << read << endl;
    }
    return 0;
}

In the program, first read 6W integers a randomly generated text file, saved to the bitmap, and then test it from this well-established bitmap data looks 100W consuming cases (about 11ms), the next section the user can manually enter some integers, whether the program will automatically retrieve the data stored in the bitmap.

Introduce the topic so that you can solve the first problem, the text data input to the known data can be 4 billion (4 billion data entry may take a while, about 1300 seconds).

Here are problem-solving ideas introduced in the remaining three issues.

Question 2: first create a large enough Bitmap object, and then turn the data entry before entry if a bit has been found that the data is 1, the data duplication, data can be sequentially repeated.

Question 3: first create a large enough Bitmap object, and then sequentially input data traversing Bitmap From the start position, if someone is not 0, it indicates that there is data sequentially output is not zero bit 0 is the sort of order good array (output not much sense to convert the output to a file is written, the new data file is sorted).

Question 4: 1, the establishment of two large enough Bitmap object, followed by data entry before entry to determine whether the data in bitmap1 whether there is (that is, whether the corresponding bit is 1), does not exist, entered into bitmap1, there would Entry to bitmap2 in; after all input sequentially traverse bitmap1 each one, if a bit = 1 but bitmap2 corresponding bit is not 1, it indicates that the data appears only once, to sequentially output.

Method 2, to establish a large enough Bitmap object, but with two showing a data, 00 indicates that the data does not exist, the data 01 occurs once, 10 indicates that the data appear multiple times. 11 it? Side cool go, do not you, ha ha. This in turn when entering data, if the corresponding bit (actually two) is 00 to 01,01 to 10,10 to go on. After the entry is complete, traverse the entire bitmap, to find the 01 outputs.

Well, the common big data topics on the adoption of bitmap this amazing structure to solve, but the bitmap is not a panacea, it is clear, it being only suitable for storing integer data, of course, here only consider the unsigned data type, if it is of type int then click on the corresponding map can also no problem. But even so, it can only handle one billion level data, if a larger amount of data, not just the type of plastic it?

For example: the need to write a web spider (web crawler). Since the link between networks intricate spider crawling across the network is likely to form a "ring." In order to avoid the formation of "rings", you need to know what those spiders have visited URL. To a URL, how do you know if the spider has visited?

Difficult to think of several options as follows:

1. Place all visited URL saved to the database;

2. The visited URL and save them in HashSet. Just close that the cost of O (1) can be found on whether a URL is accessed;

3. URL through MD5 or SHA-1 one-way hash, etc. then saved to the database or HashSet.

4. Bit-Map method. Establish a BitSet, each URL through a hash function to map a bit.

1 to 3 are the visited URL intact, only a mapping method 4-bit tag URL.

In the above method is small amount of data can be the perfect solution to the problem, but when the amount of data becomes very large question came.

Method 1: After the amount of data becomes very large relational database query efficiency will be very low. And each to a URL to start a database query is not too over done?

Method 2: too consume memory. With the increase of the URL, the memory will be more and more. Even if only 100 million URL, each URL count only 50 characters, you need to 5GB of memory.

Method 3: Because the string after the MD5 message digest length processed only 128Bit, SHA-1 treatment after only 160Bit, so the ratio of Tier 3 Tier 2 several times to save memory.

Method 4: memory consumption is relatively small, but the drawback is the probability that a single hash function conflict is too high. Remember the data structure of school class conflict over Hash table various solutions it? To reduce the probability of conflict to one percent, will be the length of the BitSet is set to 100 times the number of URL.

But we can consider if you ignore false positives in the situation to some extent, it is not possible to achieve this functionality through 4 improved method? In fact, this is the thinking Bloom Filter algorithm: Bloom Filter is a function of the Bloom Duo Haxi proposed in 1970 mapped fast search algorithm. Typically used in some need to quickly determine whether an element belongs to the set, but does not strictly require 100% correct occasion. The idea is in the process of 4 on the basis of some improvements, not mapped to one, but by mapping K hash function to the K position, so that only when the new URL calculated K bits are 1 when the judge for the URL has been visited (there is the possibility of false positives, but there are studies to prove, that allows to obtain false positives when appropriate K value and the bitmap bits so small that can be ignored, see details)

Of course, it can be handled by map-reduce, but after all, people mapreduce expert, professional large data processing technology it!
     
         
       
         
  More:      
 
- 64-bit Windows Server 2012 R2 install Oracle 10g Second Edition (Database)
- iptraf: A Practical TCP / UDP network monitoring tools (Linux)
- CoreOS Linux introduces Kubernetes kubelet (Server)
- open V switch port mirror in OpenStack neutron (Server)
- MySQL error: ERROR 1175: You are using safe update mode solution (Database)
- Linux / CentOS 7.0 installation and configuration under Tomcat 8.0 (Server)
- How to remove the files inside the privacy of data on Linux (Linux)
- Install Ubuntu Software Center App Grid (Linux)
- Java implementation of stacks and queues (Programming)
- How to Install terminator 0.98 on Ubuntu and Linux Mint (Linux)
- Android determine the device network connection status, and determine the connection (Programming)
- A detailed introduction to the Hadoop ecosystem (Server)
- Fedora 20 users install the Mate 1.8 desktop (Linux)
- How to track performance issues when using the Unity Game Development Android (Programming)
- Ubuntu installation module Python rq (Linux)
- to install the deployment of LVS under CentOS 7.0 (Server)
- Ubuntu 14.04 build Android 5.1 development environment and compiler (Linux)
- NFS-based services and service utilization Corosync DRBD high availability cluster configuration, respectively (Server)
- Linux 10 useful examples of command-line completion (Linux)
- Use OpenSSL carried BASE64 encoding and decoding (Linux)
     
           
     
  CopyRight 2002-2016 newfreesoft.com, All Rights Reserved.