HiveQL查询连续三天有销售记录的店铺
点击关注上方“
知了小巷
“
,
设为“置顶或星标”,第一时间送达干货。
背景
有一份关于各个店铺的销售数据,简单起见,只有店铺ID、记录时间、销售数量三个字段。
shopid | dt | sale |
店铺ID | 时间 | 销售数量 |
问题(Ask Question)
连续三天有销售记录的店铺
要点
1.时间连续,当前行的时间减去当前行所在的排序序号得到的结果相同即时间连续
2.把差值相同的行分组聚合计算count,数量大于等于3即连续三天
2016-1 = 2015
2017-2 = 2015
2018-3 = 2015
数据集
1.原始数据 sale.dat
1001,2019-10-11,300
1001,2019-10-12,200
1001,2019-10-13,100
1001,2019-10-15,100
1001,2019-10-16,300
1001,2019-10-17,150
1001,2019-10-18,340
1001,2019-10-19,360
1005,2019-10-11,400
1005,2019-10-12,200
1005,2019-10-15,600
1006,2019-10-11,350
1006,2019-10-13,250
1006,2019-10-14,300
1006,2019-10-15,400
1006,2019-10-16,200
1008,2019-10-13,500
1010,2019-10-14,600
1010,2019-10-15,500
1008,2019-10-14,600
2.分析 :
给每个用户一个编号,用日期减去编号,如果是同一天,那么就是连续的
1001,2019-10-11,300,1,2019-10-10
1001,2019-10-12,200,2,2019-10-10
1001,2019-10-13,100,3,2019-10-10
1001,2019-10-15,100,4,2019-10-11
1001,2019-10-16,300,5,2019-10-11
1001,2019-10-17,150,6,2019-10-11
1001,2019-10-18,340,7,2019-10-11
1001,2019-10-19,360,8,2019-10-11
1005,2019-10-11,400
1005,2019-10-12,200
1005,2019-10-15,600
1006,2019-10-11,350
1006,2019-10-13,250
1006,2019-10-14,300
1006,2019-10-15,400
1006,2019-10-16,200
1008,2019-10-13,500
1010,2019-10-14,600
1010,2019-10-15,500
1008,2019-10-14,600
数据表和数据准备
0.建表,加载数据
create table t_jd ( shopid string, dt string, sale int ) row format delimited fields terminated by ',';
hive> load data local inpath '/Users/xxx/Development/logs/sale.dat' into table t_jd; Loading data to table default.t_jd OK
SQL流程
1.打编号-排序序号
select shopid, dt, sale, row_number() over(partition by shopid order by dt) as rn from t_jd;
结果:
... MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 13776 HDFS Write: 1774 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1001 2019-10-11 300 1 1001 2019-10-12 200 2 1001 2019-10-13 100 3 1001 2019-10-15 100 4 1001 2019-10-16 300 5 1001 2019-10-17 150 6 1001 2019-10-18 340 7 1001 2019-10-19 360 8 1005 2019-10-11 400 1 1005 2019-10-12 200 2 1005 2019-10-15 600 3 1006 2019-10-11 350 1 1006 2019-10-13 250 2 1006 2019-10-14 300 3 1006 2019-10-15 400 4 1006 2019-10-16 200 5 1008 2019-10-13 500 1 1008 2019-10-14 600 2 1010 2019-10-14 600 1 1010 2019-10-15 500 2
2.根据编号,生成连续日期
select shopid, dt, sale, rn, date_sub(to_date(dt), rn) from ( select shopid, dt, sale, row_number() over(partition by shopid order by dt) as rn from t_jd ) tmp;
结果:
... MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 14576 HDFS Write: 1774 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1001 2019-10-11 300 1 2019-10-10 1001 2019-10-12 200 2 2019-10-10 1001 2019-10-13 100 3 2019-10-10 1001 2019-10-15 100 4 2019-10-11 1001 2019-10-16 300 5 2019-10-11 1001 2019-10-17 150 6 2019-10-11 1001 2019-10-18 340 7 2019-10-11 1001 2019-10-19 360 8 2019-10-11 1005 2019-10-11 400 1 2019-10-10 1005 2019-10-12 200 2 2019-10-10 1005 2019-10-15 600 3 2019-10-12 1006 2019-10-11 350 1 2019-10-10 1006 2019-10-13 250 2 2019-10-11 1006 2019-10-14 300 3 2019-10-11 1006 2019-10-15 400 4 2019-10-11 1006 2019-10-16 200 5 2019-10-11 1008 2019-10-13 500 1 2019-10-12 1008 2019-10-14 600 2 2019-10-12 1010 2019-10-14 600 1 2019-10-13 1010 2019-10-15 500 2 2019-10-13
4 分组,求count
select shopid, count(1) as cnt from ( select shopid, dt, sale, rn, date_sub(to_date(dt), rn) as flag from ( select shopid, dt, sale, row_number() over(partition by shopid order by dt) as rn from t_jd ) tmp ) tmp2 group by shopid, flag;
结果:
... MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 800 HDFS Write: 0 SUCCESS Stage-Stage-2: HDFS Read: 800 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1001 3 1001 5 1005 2 1005 1 1006 1 1006 4 1008 2 1010 2
5 筛选出连续天数大于等于3的
select shopid from ( select shopid, count(1) as cnt from ( select shopid, dt, sale, rn, date_sub(to_date(dt), rn) as flag from ( select shopid, dt, sale, row_number() over(partition by shopid order by dt) as rn from t_jd ) tmp ) tmp2 group by shopid, flag ) tmp3 where tmp3.cnt >= 3;
结果:
... MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 1600 HDFS Write: 0 SUCCESS Stage-Stage-2: HDFS Read: 1600 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1001 1001 1006
6 去重
select distinct shopid from ( select shopid, count(1) as cnt from ( select shopid, dt, sale, rn, date_sub(to_date(dt), rn) as flag from ( select shopid, dt, sale, row_number() over(partition by shopid order by dt) as rn from t_jd ) tmp ) tmp2 group by shopid, flag ) tmp3 where tmp3.cnt >= 3;
结果:
... OK 1001 1006
推荐阅读:
Hive-DML(Data Manipulation Language)数据操作语言
Hive-DDL(Data Definition Language)数据定义
欢迎关注我的公众号“
知了小巷
”,如果喜欢,麻烦点一下“
在看
”~