HiveQL查询连续三天有销售记录的店铺

点击关注上方“


知了小巷













设为“置顶或星标”,第一时间送达干货。

背景
有一份关于各个店铺的销售数据,简单起见,只有店铺ID、记录时间、销售数量三个字段。

shopid dt sale
店铺ID 时间 销售数量

问题(Ask Question)
连续三天有销售记录的店铺
要点
1.时间连续,当前行的时间减去当前行所在的排序序号得到的结果相同即时间连续
2.把差值相同的行分组聚合计算count,数量大于等于3即连续三天
2016-1 = 2015
2017-2 = 2015
2018-3 = 2015
数据集
1.原始数据 sale.dat
1001,2019-10-11,300
1001,2019-10-12,200
1001,2019-10-13,100
1001,2019-10-15,100
1001,2019-10-16,300
1001,2019-10-17,150
1001,2019-10-18,340
1001,2019-10-19,360
1005,2019-10-11,400
1005,2019-10-12,200
1005,2019-10-15,600
1006,2019-10-11,350
1006,2019-10-13,250
1006,2019-10-14,300
1006,2019-10-15,400
1006,2019-10-16,200
1008,2019-10-13,500
1010,2019-10-14,600
1010,2019-10-15,500
1008,2019-10-14,600

2.分析 :
给每个用户一个编号,用日期减去编号,如果是同一天,那么就是连续的

1001,2019-10-11,300,1,2019-10-10
1001,2019-10-12,200,2,2019-10-10
1001,2019-10-13,100,3,2019-10-10
1001,2019-10-15,100,4,2019-10-11
1001,2019-10-16,300,5,2019-10-11
1001,2019-10-17,150,6,2019-10-11
1001,2019-10-18,340,7,2019-10-11
1001,2019-10-19,360,8,2019-10-11
1005,2019-10-11,400
1005,2019-10-12,200
1005,2019-10-15,600
1006,2019-10-11,350
1006,2019-10-13,250
1006,2019-10-14,300
1006,2019-10-15,400
1006,2019-10-16,200
1008,2019-10-13,500
1010,2019-10-14,600
1010,2019-10-15,500
1008,2019-10-14,600
数据表和数据准备
0.建表,加载数据

create table t_jd (

    shopid string,

    dt string,

    sale int

) row format delimited fields terminated by ',';

hive> load data local inpath '/Users/xxx/Development/logs/sale.dat' into table t_jd; Loading data to table default.t_jd OK

SQL流程
1.打编号-排序序号

select

    shopid,

    dt,

    sale,

    row_number() over(partition by shopid order by dt) as rn

from t_jd;

结果:

...

MapReduce Jobs Launched: 

Stage-Stage-1:  HDFS Read: 13776 HDFS Write: 1774 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001  2019-10-11  300  1

1001  2019-10-12  200  2

1001  2019-10-13  100  3

1001  2019-10-15  100  4

1001  2019-10-16  300  5

1001  2019-10-17  150  6

1001  2019-10-18  340  7

1001  2019-10-19  360  8

1005  2019-10-11  400  1

1005  2019-10-12  200  2

1005  2019-10-15  600  3

1006  2019-10-11  350  1

1006  2019-10-13  250  2

1006  2019-10-14  300  3

1006  2019-10-15  400  4

1006  2019-10-16  200  5

1008  2019-10-13  500  1

1008  2019-10-14  600  2

1010  2019-10-14  600  1

1010  2019-10-15  500  2

2.根据编号,生成连续日期

select

    shopid,

    dt,

    sale,

    rn,

    date_sub(to_date(dt), rn)

from (

    select

        shopid,

        dt,

        sale,

        row_number() over(partition by shopid order by dt) as rn

    from t_jd

) tmp;

结果:

...

MapReduce Jobs Launched: 

Stage-Stage-1:  HDFS Read: 14576 HDFS Write: 1774 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001  2019-10-11  300  1  2019-10-10

1001  2019-10-12  200  2  2019-10-10

1001  2019-10-13  100  3  2019-10-10

1001  2019-10-15  100  4  2019-10-11

1001  2019-10-16  300  5  2019-10-11

1001  2019-10-17  150  6  2019-10-11

1001  2019-10-18  340  7  2019-10-11

1001  2019-10-19  360  8  2019-10-11

1005  2019-10-11  400  1  2019-10-10

1005  2019-10-12  200  2  2019-10-10

1005  2019-10-15  600  3  2019-10-12

1006  2019-10-11  350  1  2019-10-10

1006  2019-10-13  250  2  2019-10-11

1006  2019-10-14  300  3  2019-10-11

1006  2019-10-15  400  4  2019-10-11

1006  2019-10-16  200  5  2019-10-11

1008  2019-10-13  500  1  2019-10-12

1008  2019-10-14  600  2  2019-10-12

1010  2019-10-14  600  1  2019-10-13

1010  2019-10-15  500  2  2019-10-13

4 分组,求count

select

    shopid,

    count(1) as cnt

from (

    select

        shopid,

        dt,

        sale,

        rn,

        date_sub(to_date(dt), rn) as flag

    from (

        select

            shopid,

            dt,

            sale,

            row_number() over(partition by shopid order by dt) as rn

        from t_jd

    ) tmp

) tmp2

group by shopid, flag;

结果:

...

MapReduce Jobs Launched: 

Stage-Stage-1:  HDFS Read: 800 HDFS Write: 0 SUCCESS

Stage-Stage-2:  HDFS Read: 800 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001  3

1001  5

1005  2

1005  1

1006  1

1006  4

1008  2

1010  2

5 筛选出连续天数大于等于3的

select

    shopid

from (

    select

        shopid,

        count(1) as cnt

    from (

        select

            shopid,

            dt,

            sale,

            rn,

            date_sub(to_date(dt), rn) as flag

        from (

            select

                shopid,

                dt,

                sale,

                row_number() over(partition by shopid order by dt) as rn

            from t_jd

        ) tmp

    ) tmp2

    group by shopid, flag

) tmp3

where tmp3.cnt >= 3;

结果:

...

MapReduce Jobs Launched: 

Stage-Stage-1:  HDFS Read: 1600 HDFS Write: 0 SUCCESS

Stage-Stage-2:  HDFS Read: 1600 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

1001

1001

1006

6 去重

select

    distinct shopid

from (

    select

        shopid,

        count(1) as cnt

    from (

        select

            shopid,

            dt,

            sale,

            rn,

            date_sub(to_date(dt), rn) as flag

        from (

            select

                shopid,

                dt,

                sale,

                row_number() over(partition by shopid order by dt) as rn

            from t_jd

        ) tmp

    ) tmp2

    group by shopid, flag

) tmp3

where tmp3.cnt >= 3;

结果:

...

OK

1001

1006

推荐阅读:



HiveQL实战蚂蚁森林低碳用户排名分析:解法二



HiveQL实战蚂蚁森林低碳用户排名分析:解法一


HiveQL实战蚂蚁森林植物申领统计分析


深入理解Hive数据存储格式和压缩要点


Hive-函数

Hive-查询


Hive-DML(Data Manipulation Language)数据操作语言


Hive-DDL(Data Definition Language)数据定义

Hive优化(整理版)

Spark Core之Shuffle解析

数据仓库开发规范

欢迎关注我的公众号“

知了小巷

”,如果喜欢,麻烦点一下“
在看
”~