Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Database \ Hive handle count distinct inclination to produce data processing     - How to migrate MySQL to MariaDB under linux (Database)

- Standard and IO redirection (Linux)

- SQL Server 2012 failover looksalive check and is alive check (Database)

- MySQL high availability cluster fragmentation of deployment uses Fabric (Database)

- Use Python automatically cleared Android Engineering excess resources (Programming)

- Chkconfig command Detailed service is added and shut down the system in two ways to start service under Linux (Linux)

- MySQL partition table Comments (Database)

- 20 Advanced Java interview questions summary (Programming)

- To install Scribus 1.4.4 under ubuntu (Linux)

- Different between Linux file path and the windows (Linux)

- Multi-core CPU, multi-threading and parallel computation (Linux)

- Android Studio commonly used shortcuts and how to follow the Eclipse Shortcuts (Linux)

- Android Studio and Git Git configuration file status (Linux)

- PL / SQL in forall simple test (Database)

- Ubuntu 14.10 / 14.04 / 12.04 installation GNOME Pie 0.5.6 (Linux)

- Some of the bibliographic management tools to good use on linux (Linux)

- Linux common commands: nslookup, ls md5sum, uname, history, etc. (Linux)

- MySQL and MariaDB new master from the cluster configuration GTID (Database)

- How to use Linux iptables tool for network sharing (Linux)

- Analytical Ceph: handle network layer (Server)

 
         
  Hive handle count distinct inclination to produce data processing
     
  Add Date : 2018-11-21      
         
         
         
  Problem Description

Data skew problem areas, but can not join the Map end, excluding special Key and other methods of treatment.

set hive.groupby.skewindata = true;

insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
select account_id, nvl (client_id, -1), nvl (track_id, 'total'), sum (if (type = 3,1,0)) as imp_cnt,
sum (if (type = 4,1,0)) as click_cnt, count (distinct if (type = 3, zid, NULL)) as imp_uv,
count (distinct if (type = 4, zid, NULL)) as click_uv
from derived_di_v3
where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))
But error find.
FAILED: SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data

separate group-by and join the results.

set hive.groupby.skewindata = true;
set hive.exec.parallel = true;

insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
SELECT COALESCE (t1.account_id, t2.account_id), COALESCE (t1.client_id, t2.client_id),
    COALESCE (t1.track_id, t2.track_id), t1.imp_cnt, t1.imp_uv, t2.click_cnt, t2.click_uv
FROM
(Select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (if (type = 3,1,0)) as imp_cnt, count (distinct if (type = 3, zid, NULL)) as imp_uv
FROM derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))) t1
FULL OUTER JOIN
(Select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (if (type = 4,1,0)) as click_cnt, count (distinct if (type = 4, zid, NULL)) as click_uv
FROM derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))) t2
ON t1.account_id = t2.account_id and t1.client_id = t2.client_id and t1.track_id = t2.track_id;
Cann't run two MapReduce Job

Unfortunately, hive does not explain the hql to two mapreduce job.
The parameter hive.groupby.skewindata seems has no affect.

Change the hql:

insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
    sum (imp1) as imp_cnt, count (imp2) as imp_uv, sum (click1) as click_cnt, count (click2) as click_uv
FROM (select account_id, client_id, track_id,
     if (type = 3,1,0) as imp1, if (type = 3, zid, NULL) as imp2,
     if (type = 4,1,0) as click1, if (type = 4, zid, NULL) as click2
    FROM dmp.derived_di_v3 where year = '2015' and month = '11 '
    group by account_id, client_id, track_id, type, zid) t
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))
     
         
         
         
  More:      
 
- Installation configuration CUDA under Ubuntu 14.04 (Linux)
- Java keyword final, static (Programming)
- Oracle Enterprise Linux 64-bit install apache-tomcat-7.0.53 step (Server)
- Linux Command Line Art (Linux)
- Installation under Linux Mint system guidelines for Gtk (Linux)
- Virtual Judge environment to build and configure under Ubuntu (Server)
- Use $ BASH ENV variable to mention the right way under Linux (Linux)
- C language keywords Comments (Programming)
- Linux System Getting Started Learning: Linux how to install 7zip (Linux)
- Oracle SDE and maintain common commands - Display space (Database)
- CentOS6 install Redis (Database)
- How to install Ubuntu applications Device 7 (Linux)
- CentOS7 set boot directly into the command line interface (Linux)
- Why do I prefer Git (Linux)
- Linux the best download manager uGet (Linux)
- How to make GRub instead of the default Ubuntu software center (Linux)
- 64-bit Windows Server 2012 R2 install Oracle 10g Second Edition (Database)
- Perl said method B if A judge (Programming)
- How to install or upgrade to the Linux kernel in Ubuntu 4.2 (Linux)
- Android Studio simple setup (Linux)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.