Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Database \ Hive handle count distinct inclination to produce data processing     - The Linux firewall is configured to use proxy (Linux)

- Linux SSH login without a password (Linux)

- Linux --- process tracking (Linux)

- Ubuntu 14.04 installed Nvidia CUDA 7.5 and build Python Theano deep learning development environment (Linux)

- MySQL5.7.10 installation documentation (Database)

- Try to use Lets Encrypt (Linux)

- Ubuntu install ruby (Linux)

- The execution order of Oracle WHERE condition is not from right to left (Database)

- Cobbler Add custom YUM source (Linux)

- Some practical tips Linux (Linux)

- Boost notes --Asio - (1) a simple small example of synchronous communication (Programming)

- Mass data storage application of MongoDB database (Database)

- RHEL7 system making use of OpenStack mirror (Linux)

- Shutdown - an advanced shutdown artifact (Linux)

- Practical Handbook: 130+ improve the efficiency of commonly used commands Vim (Linux)

- MongoDB start under Linux (Database)

- When Linux Detailed time zone and common function of time (Linux)

- CentOS 7 x64 compiler installation Tengine 2.0.3 Comments (Server)

- How to extend / remove swap partitions (Linux)

- Java multi-threaded communications pipeline flow (Programming)

 
         
  Hive handle count distinct inclination to produce data processing
     
  Add Date : 2018-11-21      
         
         
         
  Problem Description

Data skew problem areas, but can not join the Map end, excluding special Key and other methods of treatment.

set hive.groupby.skewindata = true;

insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
select account_id, nvl (client_id, -1), nvl (track_id, 'total'), sum (if (type = 3,1,0)) as imp_cnt,
sum (if (type = 4,1,0)) as click_cnt, count (distinct if (type = 3, zid, NULL)) as imp_uv,
count (distinct if (type = 4, zid, NULL)) as click_uv
from derived_di_v3
where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))
But error find.
FAILED: SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data

separate group-by and join the results.

set hive.groupby.skewindata = true;
set hive.exec.parallel = true;

insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
SELECT COALESCE (t1.account_id, t2.account_id), COALESCE (t1.client_id, t2.client_id),
    COALESCE (t1.track_id, t2.track_id), t1.imp_cnt, t1.imp_uv, t2.click_cnt, t2.click_uv
FROM
(Select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (if (type = 3,1,0)) as imp_cnt, count (distinct if (type = 3, zid, NULL)) as imp_uv
FROM derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))) t1
FULL OUTER JOIN
(Select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (if (type = 4,1,0)) as click_cnt, count (distinct if (type = 4, zid, NULL)) as click_uv
FROM derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))) t2
ON t1.account_id = t2.account_id and t1.client_id = t2.client_id and t1.track_id = t2.track_id;
Cann't run two MapReduce Job

Unfortunately, hive does not explain the hql to two mapreduce job.
The parameter hive.groupby.skewindata seems has no affect.

Change the hql:

insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
    sum (imp1) as imp_cnt, count (imp2) as imp_uv, sum (click1) as click_cnt, count (click2) as click_uv
FROM (select account_id, client_id, track_id,
     if (type = 3,1,0) as imp1, if (type = 3, zid, NULL) as imp2,
     if (type = 4,1,0) as click1, if (type = 4, zid, NULL) as click2
    FROM dmp.derived_di_v3 where year = '2015' and month = '11 '
    group by account_id, client_id, track_id, type, zid) t
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))
     
         
         
         
  More:      
 
- VirtualBox 4.3 Can not open a new task for a virtual computer solution (Linux)
- awk variables (Linux)
- Use Swift remove the spaces in the string (Programming)
- ORA-01000 Solution (Database)
- Ubuntu 14.04 after the restart the default maximum screen brightness solutions (Linux)
- FreeRadius installation process record (Linux)
- Binder began to talk about inter-process communication mechanism from Android AIDL (Programming)
- How to use the on-screen keyboard in Linux (Linux)
- Mac Docker deploy development environment (Server)
- Linux operating system security management skills (Linux)
- Git Installation and Configuration (Network Agent settings) (Linux)
- Ubuntu and derivative users to install the system launcher SimDock 1.3 (Linux)
- SYN attack hacker attack and defense of the basic principles and prevention technology (Linux)
- Merge sort Java implementation (Programming)
- Security implementation of disk data protection under Linux (Linux)
- 10 easy to use Linux utilities Recommended (Linux)
- 127.0.0.1 and localhost difference (Server)
- sa weak passwords intrusion prevention (Linux)
- To convert into a binary search tree sorted doubly linked list (Programming)
- Hadoop new and old version of the difference in the size of the InputSplit (Server)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.