Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Programming \ Correlation Analysis: FP-Growth algorithm     - CentOS 6.5 set under Oracle 12c at startup (Database)

- LAMP environment to build Apache, MySQL, PHP under Ubuntu (Server)

- CentOS 6.5 install VNC-Server (Linux)

- Mac OS X command line to submit the local project to Git (Server)

- Java Database Programming JDBC configuration (Programming)

- DB2 table space is redirected to restore the database combat (Database)

- Java development specifications summary (Programming)

- Binding unofficial Google Drive and Ubuntu 14.04 LTS (Linux)

- Python context managers (Programming)

- Linux kernel network subsystem analysis (Programming)

- Installation Atom text editor on Mint Ubuntu / Linux (Linux)

- CentOS 6.6 running level (Linux)

- Linux SVN account password to save your settings (Linux)

- Ubuntu 14.04 LTS 64-bit install GNS3 1.3.7 (Linux)

- Use SVN to automatically deploy code under Apache (Server)

- Nagios (centreon) monitoring Linux Log (Server)

- CentOS Linux Optimization and real production environment (Linux)

- SUSE Linux install Oracle 10g and problem solving (Linux)

- CentOS modify yum update source (Linux)

- Oracle user password problem (Database)

 
         
  Correlation Analysis: FP-Growth algorithm
     
  Add Date : 2018-11-21      
         
         
         
  Correlation Analysis, also known as association mining, that is, transactions, relational data, or other information carriers, find frequent patterns exist in the project or objects between the sets, association, correlation or causal structure. A typical example of the related analysis is market basket analysis. By finding customers into links between different commodities basket, the analysis of customer buying habits. For example, while 67% of customers buy diapers also buy beer. By knowing which products frequently purchased by customers at the same time, can help retailers develop marketing strategies. Association analysis can also be applied to other fields, such as bioinformatics, medical diagnostics, web mining and scientific data analysis.

1. Problem Definition
Shopping basket represents customer data, where each row is a record of each customer's shopping, corresponding to a transaction, and each column corresponds to an item. So I = {i1, i2, ..., id} shopping basket collection of all data items, and T = {t1, t2, ..., tN} is the set of all transactions. Item set contains each transaction ti I are a subset. In the correlation analysis, the set contains zero or more items are called item set (itemset). The so-called association rule refers to the form of expression X Y, where X and Y are disjoint sets of items. In correlation analysis, there are two important concepts - support (support) and confidence (confidence). Support can be used to determine the rules of how often a given data set, and confidence to determine the frequency of Y appear in transactions contain X's. Support (s) and confidence (c) of the definition of these two forms of measurement
Where, N is the total number of transactions. Support association rules is very low, indicating that the rule only occurs occasionally, there is not much sense. On the other hand, confidence can be measured reasoning by association rules reliability. Thus, most related policy analysis algorithm used is:

(1) Frequent itemset generation: its goal is to find all set to meet the minimum support threshold, these items sets called frequent item sets.

(2) Generate rules: The goal is frequent items found from the previous step extract a high degree of confidence of all the rules, these rules are called strong rules.

 

2. Construction of FP-tree

FP-growth algorithm by constructing FP-tree to compress the transaction information in the database, to more effectively generate frequent item sets. FP-tree is actually a prefix tree, the support degree descending order, the higher the degree of support frequent items nearer the root node, allowing more frequent item can be shared prefix.
For market basket analysis transactional database. Wherein, a, b, ..., p denote the items purchased by the customer. First, the transactional database scan to calculate each row in support of a variety of items, and then in descending order according to the degree of support, leaving only frequent item sets, excluding those items below the support threshold, where the support threshold take 3 to obtain <(f: 4), (c: 4), (a: 3), (b: 3), (m: 3, (p: 3)> (thanks to the support calculation formula N is constant, it is necessary to compare the molecular formula only). Figure 2 shows the results of the first three sorted.

FP-tree root node is null, do not denote any items. Next, a transactional database for the second scan, which began to build FP-tree:

First record FP-tree corresponding to the first branch <(f: 1), (c: 1), (a: 1), (m: 1) , (p: 1)>:
Since the second record the first record has the same prefix , therefore , respectively, plus a degree of support, while add nodes (b: 1) under: (2 a) node, (m: 1). So, FP-tree in the second branch is <(f: 2), (c: 2), (a: 2), (h: 1), (m: 1)>:
Third record Compared with the previous two records, only a common prefix , therefore, only in (f: 3) was added under the node :
Article records and before all the records are no common prefix, so add a node (c: 1) at the root node, (b: 1), (p: 1):


Similarly, the fifth record As a branch of FP-tree, and update support related node:
In order to facilitate the entire tree traversal, establish a header table entry (an item header table). The first column of this table is based on frequent items in descending order. The second column is the entry point to the frequent pointer FP-tree nodes position. FP-tree each node also has a pointer to point to the same node name:
To sum up, FP-tree node can be defined as:

class TreeNode {
 
private:
    String name; // node name
    int count; // support count
    TreeNode * parent; // parent node
    Vector children; // child node
    TreeNode * nextHomonym; // points to a node of the same name
     
    ...
}
3. Mining frequent patterns from the FP-tree (Frequent Patterns)

We started from scratch at the bottom of the table in the FP-tree mining frequent patterns. In the FP-tree to p end of the chain a total of two nodes, respectively <(f: 4), (c: 3), (a: 3), (m: 2), (p: 2)> and < (c: 1), (b: 1), (p: 1)>. Wherein, the first node in the list represents a list of items purchased by the customer CCP appears twice in the database. To note that, despite the appears three times in the first node in the chain, individual items appeared four times, but they are presented with only 2 p, so the conditions FP- tree in <(f: 4), (c: 3), (a: 3), (m: 2), (p: 2)> referred to as <(f: 2), (c: 2), ( a: 2), (m: 2), (p: 2)>. Similarly, the second node chain represents a list of items purchased by the customer appears only once in the database. We prefix node chain of p <(f: 2), (c: 2), (a: 2), (m: 2)> and <(c: 1), (b: 1)> is called the p conditional pattern base (conditional pattern base). We will p conditional pattern base as a new transaction database, a prefix node chain store each line of p, according to the procedure in section II construct FP-tree to calculate each row in support of various items, and then follow the support degree in descending order, retaining only frequent item sets, excluding those items below the support threshold, the establishment of a new FP-tree, the tree is called p conditions FP-tree:
With different p, m conditions FP-tree has three nodes, it is necessary to repeatedly recursively mining frequent itemsets mine (<(f: 3), (c: 3), (a: 3) | (m: 3)>). According to <(a: 3), (c: 3), (f: 3)> order recursive call mine (<(f: 3), (c: 3) | a, m>), mine (<(f : 3) | c, m>), mine (null | f, m). Since (m: 3) to meet the threshold requirement of support, so frequent item set to end m are {(m: 3)}.
As can be seen, the node (a, m) conditions FP-tree has two nodes, the need for further recursive call mine (<(f: 3) | c, a, m>) and mine (). Further recursive mine (<(f: 3) | c, a, m>) to generate mine (). Therefore, in order to (a, m) has set the end of the frequent item {(am: 3), (fam: 3), (cam: 3), (fcam: 3)}.
Node (c, m) conditions FP-tree is only one node, we only need a recursive call mine (). Therefore, (c, m) has set the end of the frequent item {(cm: 3), (fcm: 3)}. Similarly, with (f, m) has set the end of the frequent item {(fm: 3)}.

In the FP-tree to b at the end of a chain of nodes there are three, namely <(f: 4), (c: 3), (a: 3), (b: 1)>, <(f: 4), ( b: 1)> and <(c: 1), (b: 1)>. Since the node b conditional pattern base <(f: 1), (c: 1), (a: 1)>, <(f: 1)> and <(c: 1)> is not satisfied support threshold, so You do not need recursion. Therefore, frequent item sets the end of the only b (b: 3).

Similarly available to the end of a frequent item set {(fa: 3), (ca: 3), (fca: 3), (a: 3)}, with c ending frequent item sets {(fc: 3 ), (c: 4)}, in order to end the frequent item set f {(f: 4)}.

4. Algorithm

Statement FP-tree node:

class TreeNode
{

    // Constructors-Destructors
public:
    TreeNode ();
    TreeNode (string);
    ~ TreeNode ();

    // Member variables
private:
    string nodeName;
    int supportCount;
    TreeNode * parentNode;
    vector childNodeList;
    TreeNode * nextHomonymNode;

    // Member functions
public:

    string getName ();
    void setName (string);

    int getSupportCount () const;
    void setSupportCount (int);

    TreeNode * getParentNode () const;
    void setParentNode (TreeNode *);

    vector getChildNodeList () const;
    void addChild (TreeNode *);
    TreeNode * findChildNode (string) const;
    void setChildren (vector );
    void printChildrenNames () const;

    TreeNode * getNextHomonym () const;
    void setNextHomonym (TreeNode * nextHomonym);

    void countIncrement (int);
};
Construction HeaderTable:

// Store transaction database data HeaderTable
vector FPTree :: buildHeaderTable (vector > transRecords)
{
    vector F1; // satisfying the support threshold node, and in descending order according to the degree of support, in the case of support equal sorted alphabetically, so constructed FP-tree and papers vary, but the resulting frequent item set is the same
    if (transRecords.size ()> 0)
    {
        map mp;

        // Calculate supportCount of every transRecords
        for (vector record: transRecords)
        {
            for (string item: record)
            {
                // If item not in map, put item into map and set supportCount one
                if (mp.find (item) == mp.end ())
                {
                    TreeNode * node = new TreeNode (item);
                    node-> setSupportCount (1);
                    mp.insert (map :: value_type (item, node));
                }

                // If item in map, supportCount plus one
                else
                {
                    mp.find (item) -> second-> countIncrement (1);
                }
            }
        }

        // Put TreeNodes whose supportCount greater than minSupportThreshold into vector F1
        for (auto iterator = mp.begin (); iterator = mp.end ();! iterator ++)
        {
            if (iterator-> second-> getSupportCount ()> = minSupportThreshold)
            {
                // Cout << "iterator-> second =" << iterator-> second-> getSupportCount () << endl;
                F1.push_back (iterator-> second);
            }
        }

        // Sort vector F1 by supportCount
        sort (F1.begin (), F1.end (), sortBySupportCount);
    }
    return F1;
}
Construction of FP-tree:

TreeNode * FPTree :: buildTree (vector > transRecords, vector F1)
{

    TreeNode * root = new TreeNode (); // the root root
    for (vector transRecord: transRecords)
    {
        // Copy transRecord to record
        vector record;
        for (auto iter = transRecord.begin (); iter = transRecord.end ();! iter ++)
        {
            record.push_back (* iter);
        }

        record = sortedByF1 (record, F1); // based on frequent item sets F1 stored in the descending order according to the record support, and retain only the frequent item sets, excluding those below the support threshold of entry

// Compare the record of FP-tree nodes and nodes in order, if the node record already exists in the FP-tree, and the support node plus one, the comparison continues to the next node, or the call to add the remaining addNodes FP-tree node to the
        TreeNode * subTreeRoot = root;
        TreeNode * tmpRoot = nullptr;
        if (! root-> getChildNodeList (). empty ())
        {
            while (! record.empty () && (tmpRoot = subTreeRoot-> findChildNode (* (record.begin ())))! = nullptr)
            {
                tmpRoot-> countIncrement (1);
                subTreeRoot = tmpRoot;
                record.erase (record.begin ());
            }
        }
        addNodes (subTreeRoot, & record, F1);
    }
    return root;
}
Adding nodes:

void FPTree :: addNodes (TreeNode * ancestor, vector * record, vector F1)
{
    if (! record-> empty ())
    {
        while (! record-> empty ())
        {
            string item = * (record-> begin ());
            record-> erase (record-> begin ());
            TreeNode * leafNode = new TreeNode (item);
            leafNode-> setSupportCount (1);
            leafNode-> setParentNode (ancestor);
            ancestor-> addChild (leafNode);

            for (TreeNode * f1: F1)
            {
                if (f1-> getName () == item)
                {
                    while (f1-> getNextHomonym ()! = NULL)
                    {
                        f1 = f1-> getNextHomonym ();
                    }

                    f1-> setNextHomonym (leafNode);
                    break;
                }
            }

            addNodes (leafNode, record, F1);
        }
    }
}
sortedByF1:

vector FPTree :: sortedByF1 (vector transRecord, vector F1)
{
    // If the item is a frequent item, it must correspond to the F1 in number, according to the serial number of the item to sort, store the rest in
    map mp;
    for (string item: transRecord)
    {
        for (int i = 0; i         {
            TreeNode * tNode = F1 [i];
            if (tNode-> getName () == item)
            {
                mp.insert (map :: value_type (item, i));
            }
        }
    }
    vector > vec;
    for (auto iterator = mp.begin (); iterator = mp.end ();! iterator ++)
    {
        vec.push_back (make_pair (iterator-> first, iterator-> second));
    }
    sort (vec.begin (), vec.end (), sortByF1);
    vector rest;
    for (auto iterator = vec.begin (); iterator = vec.end ();! iterator ++)
    {
        rest.push_back ((* iterator) .first);
    }
    return rest;
}
Recursive call FP-Growth mining frequent items:

When // postPattern storage suffix, such as from HeaderTable the p node starts mining frequent items, postPattern to p
void FPTree :: FPGrowth (vector > transRecords, vector postPattern)
{
    vector headerTable = buildHeaderTable (transRecords); // Build headerTable
    TreeNode * treeRoot = buildTree (transRecords, headerTable); // construct FP-tree

// Recursive exit condition: the root node without children
    if (treeRoot-> getChildNodeList (). size () == 0)
    {
        return;
    }
// Output frequent itemsets
    if (! postPattern.empty ())
    {
        for (TreeNode * header: headerTable)
        {
            cout << header-> getSupportCount () << ends << header-> getName () << ends;
            for (string str: postPattern)
            {
                cout << str << ends;
            }
            cout << endl;
        }
    }

// Traverse headerTable
    for (TreeNode * header: headerTable)
    {
        vector newPostPattern;
        newPostPattern.push_back (header-> getName ());

// Store the original suffix
        if (! postPattern.empty ())
        {
            for (string str: postPattern)
            {
                newPostPattern.push_back (str);
            }
        }
// NewTransRecords prefix node chain store
        vector > newTransRecords;
        TreeNode * backNode = header-> getNextHomonym ();

// Traverse the same name by getNextHomonym node, access prefix node chain by getParentNode
        while (backNode! = nullptr)
        {
            int supportCount = backNode-> getSupportCount ();
            vector preNodes;
            TreeNode * parent = backNode;
            while ((parent = parent-> getParentNode ()) -.> getName () length () = 0!)
            {
                preNodes.push_back (parent-> getName ());
            } While (supportCount--> 0)
            {
                newTransRecords.push_back (preNodes);
            }
            backNode = backNode-> getNextHomonym ();
        }
        FPGrowth (newTransRecords, newPostPattern); // Recursive Construction conditions FP-tree
    }
}
5. Discussion

Before Professor Jiawei Han proposed FP-growth algorithm, correlation analysis commonly used by Apriori algorithm and its modification. However, Apriori algorithm and its modification requires multiple scan the database and need to generate exponential candidate set, performance is not satisfactory. FP-growth algorithm utilizes the efficient data structure of FP-tree, no longer need to repeatedly scan the database, but also no longer need to generate a large number of candidates.

For FP-tree single path actually we do not need recursion can be directly generated by the permutations and combinations. Professor Han Jiawei mentioned optimization algorithm for a single path in their papers. The paper also mentioned the face of big data, how to adjust the FP-growth algorithm to adapt the amount of data.
     
         
         
         
  More:      
 
- 17 How to install the Ubuntu 14.04 and Linux Mint Kodi14 (XBMC) (Linux)
- Summary Linux operating system some tips to prevent attacks (Linux)
- Linux configuration Samba server (Server)
- Caffe + Ubuntu 14.04 64bit + CUDA 6.5 configuration instructions (Linux)
- SUSE Linux network configuration and firewall configuration (Linux)
- Some safety precautions of Linux servers (Linux)
- Use Swift remove the spaces in the string (Programming)
- To create a Linux server network security (Linux)
- Httpclient4.4 of principle (Http execution context) (Programming)
- MySQL function: group_concat () function (Database)
- The lambda expression Java8 (constructor references) (Programming)
- The principle Httpclient4.4 (HttpClient Interface) (Programming)
- Stunning exclamation point at the Linux command line (Linux)
- Redhat 7 can only be read after installation Samba service catalog approach could not be written (Server)
- Java coding conventions (Programming)
- Linux modify environment variables method (Linux)
- 6 common PHP security attacks (Linux)
- Linux garbled file delete method (Linux)
- stat - Get more information than ls (Linux)
- Install Python 3.3.4 under CentOS 6.4 (Linux)
     
           
     
  CopyRight 2002-2020 newfreesoft.com, All Rights Reserved.