|
Problem Description
Data skew problem areas, but can not join the Map end, excluding special Key and other methods of treatment.
set hive.groupby.skewindata = true;
insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
select account_id, nvl (client_id, -1), nvl (track_id, 'total'), sum (if (type = 3,1,0)) as imp_cnt,
sum (if (type = 4,1,0)) as click_cnt, count (distinct if (type = 3, zid, NULL)) as imp_uv,
count (distinct if (type = 4, zid, NULL)) as click_uv
from derived_di_v3
where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))
But error find.
FAILED: SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data
separate group-by and join the results.
set hive.groupby.skewindata = true;
set hive.exec.parallel = true;
insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
SELECT COALESCE (t1.account_id, t2.account_id), COALESCE (t1.client_id, t2.client_id),
COALESCE (t1.track_id, t2.track_id), t1.imp_cnt, t1.imp_uv, t2.click_cnt, t2.click_uv
FROM
(Select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (if (type = 3,1,0)) as imp_cnt, count (distinct if (type = 3, zid, NULL)) as imp_uv
FROM derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))) t1
FULL OUTER JOIN
(Select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (if (type = 4,1,0)) as click_cnt, count (distinct if (type = 4, zid, NULL)) as click_uv
FROM derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id))) t2
ON t1.account_id = t2.account_id and t1.client_id = t2.client_id and t1.track_id = t2.track_id;
Cann't run two MapReduce Job
Unfortunately, hive does not explain the hql to two mapreduce job.
The parameter hive.groupby.skewindata seems has no affect.
Change the hql:
insert overwrite table ad_overall_day partition (part_time = '99 ', part_date =' 2015-11-99 ')
select account_id, nvl (client_id, -1) as client_id, nvl (track_id, 'total') as track_id,
sum (imp1) as imp_cnt, count (imp2) as imp_uv, sum (click1) as click_cnt, count (click2) as click_uv
FROM (select account_id, client_id, track_id,
if (type = 3,1,0) as imp1, if (type = 3, zid, NULL) as imp2,
if (type = 4,1,0) as click1, if (type = 4, zid, NULL) as click2
FROM dmp.derived_di_v3 where year = '2015' and month = '11 '
group by account_id, client_id, track_id, type, zid) t
group by account_id, client_id, track_id
grouping sets ((account_id, client_id, track_id), (account_id, client_id), (account_id)) |
|
|
|