HQL练习某视频网站的常规TopN指标分析

点击关注上方“
知了小巷
”,

设为“置顶或星标”,第一时间送达干货。

某视频网站的TopN指标需求
1.统计视频观看数Top10
2.统计视频类别热度Top10
3.统计视频观看数Top20所属类别
4.统计视频观看数Top50所关联视频的所属类别Rank排名
5.统计每个类别中的视频热度Top10
6.统计每个类别中视频流量Top10
7.统计上传视频最多的用户Top10以及他们上传的视频
8.统计每个类别视频观看数Top10
指标分析所涉及的核心数据结构
1.视频表

字段 备注 详细描述
video_id 视频唯一id 11位字符串
uploader 视频上传者 上传视频的用户名String
age 视频年龄 视频在平台上的整数天
category 视频类别 上传视频指定的视频分类
length 视频长度 整形数字标识的视频长度
views 观看次数 视频被浏览的次数
rate 视频评分 满分5分
ratings 流量 视频的流量,整型数字
coments 评论数 一个视频的整数评论数
related_id 相关视频id 相关视频的id,最多20个

2.用户表

字段 备注 字段类型
uploader 上传者用户名 string
videos 上传视频数 int
friends 朋友数量 int

建表和数据准备
创建表(load原始数据)
ods_video_ori
ods_video_user_ori
创建表(默认TEXTFILE转到ORC格式)
ods_video_orc
ods_video_user_orc
ods_video_ori

create table ods_video_ori (

    video_id string, 

    uploader string, 

    age int, 

    category array, 

    length int, 

    views int,

    rate float,

    ratings int, 

    comments int,

    related_id array

) row format delimited 

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

ods_video_user_ori

create table ods_video_user_ori (

    uploader string,

    videos int,

    friends int

) row format delimited 

fields terminated by "\t" 

stored as textfile;

然后把原始数据插入到ORC表中
ods_video_orc

create table ods_video_orc (

    video_id string, 

    uploader string, 

    age int, 

    category array, 

    length int, 

    views int, 

    rate float, 

    ratings int, 

    comments int,

    related_id array

) row format delimited fields terminated by "\t" 

collection items terminated by "&" 

stored as orc;

ods_video_user_orc

create table ods_video_user_orc (

    uploader string,

    videos int,

    friends int

) row format delimited 

fields terminated by "\t" 

stored as orc;

导入数据
ods_video_ori(一个文件夹下多个文件)

hive> load data local inpath "/Users/xxx/Development/logs/video/" into table ods_video_ori;

Loading data to table default.ods_video_ori

OK

hive> select * from ods_video_ori limit 5; OK LKh7zAJ4nwo TheReceptionist 653 ["Entertainment"] 424 13021 4.34 1305 744 ["DjdA-5oKYFQ"] 7D0Mf4Kn4Xk periurban 583 ["Music"] 201 6508 4.19 687 312 ["e2k0h6tPvGc"] n1cEq1C8oqQ Pipistrello 525 ["Comedy"] 125 1687 4.01 363 141 ["eprHhmurMHg"] OHkEzL4Unck ichannel 638 ["Comedy"] 299 8043 4.4 518 371 ["eyUSTmEUQRg"] -boOvAGNKUc mrpitifulband 639 ["Music"] 287 7548 4.48 606 386 ["fmUwUURgsX0"] Time taken: 0.293 seconds, Fetched: 5 row(s)

ods_video_user_ori

hive> load data local inpath "/Users/xxx/Development/logs/user.txt" into table ods_video_user_ori;
hive> select * from ods_video_user_ori limit 5;

OK

barelypolitical 151 5106

bonk65  89  144

camelcars   26  674

cubskickass34   13  126

boydism08   32  50

Time taken: 2.855 seconds, Fetched: 5 row(s)

向ORC表插入数据

ods_video_orc

insert into table ods_video_orc select * from ods_video_ori;

ods_video_user_orc insert into table ods_video_user_orc select * from ods_video_user_ori;

业务分析

统计视频观看数Top10

思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
【全局排序】
最终sql:

select 

    video_id, 

    uploader, 

    age, 

    category, 

    length, 

    views, 

    rate, 

    ratings, 

    comments 

from ods_video_orc 

order by views desc limit 10;

需要注意内存不够的话会报错:

Ended Job = job_local1559464187_0005 with errors

Error during job, obtaining debugging information...

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

统计视频类别热度Top10

思路:某类别下的视频越多则热度越高
1.即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2.我们需要按照类别group by聚合,然后count组内的video_id个数即可。
3.因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。
4.最后按照热度排序,显示前10条。
【lateral view explode的使用】
最终sql:

select 

    category_name as category, 

    count(t1.video_id) as hot 

from (

    select

        video_id,

        category_name

    from ods_video_orc lateral view explode(category) t_catetory as category_name

) t1 

group by t1.category_name 

order by hot desc limit 10;

... MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 13812458 HDFS Write: 0 SUCCESS Stage-Stage-2: HDFS Read: 13812458 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Music 179049 Entertainment 127674 Comedy 87818 Film 73293 Animation 73293 Sports 67329 Games 59817 Gadgets 59817 Blogs 48890 People 48890 ...

统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路:
1.先找到观看数最高的20个视频所属条目的所有信息,降序排列
2.把这20条信息中的category分裂出来(列转行)
3.最后查询视频分类名称和该分类下有多少个Top20的视频
最终sql:

select 

    category_name as category, 

    count(t2.video_id) as hot_with_views 

from (

    select 

        video_id, 

        category_name

    from (

        select

            *

        from 

            ods_video_orc 

        order by 

            views 

        desc limit 

            20

    ) t1 lateral view explode(category) t_catetory as category_name

) t2 

group by category_name 

order by hot_with_views desc;

... MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 30471366 HDFS Write: 0 SUCCESS Stage-Stage-2: HDFS Read: 30471366 HDFS Write: 0 SUCCESS Stage-Stage-3: HDFS Read: 30471366 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Entertainment 6 Comedy 6 Music 5 People 2 Blogs 2 UNA 1 ...

统计视频观看数Top50所关联视频的所属类别排序

思路:
1.查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1
t1:观看数前50的视频

select 

    * 

from ods_video_orc 

order by views desc limit 50;

2.将找到的50条视频信息的相关视频related_id列转行,记为临时表t2
t2:将相关视频的id进行列转换操作

select 

    explode(related_id) as videoId 

from t1;

3.将相关视频的id和ods_video_orc表进行inner join操作
t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id

(

    select 

        distinct(t2.video_id), 

        t3.category 

    from t2 inner join ods_video_orc t3 on t2.videoId = t3.videoId

) t4 lateral view explode(category) t_catetory as category_name;

4.按照视频类别进行分组,统计每组视频个数,然后排行
最终sql:

select 

    category_name as category, 

    count(t5.video_id) as hot 

from (

    select 

        video_id, 

        category_name 

    from (

        select 

            distinct(t2.video_id), 

            t3.category 

        from (

            select 

                explode(related_id) as video_id 

            from (

                select 

                    * 

                from ods_video_orc order by views desc limit 50

            ) t1

        ) t2 inner join ods_video_orc t3 on t2.video_id = t3.video_id

    ) t4 lateral view explode(category) t_catetory as category_name

) t5

group by category_name order by hot desc;

统计每个类别中的视频热度Top10,以Music为例

思路:
1.要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放category_id展开的数据。
2.向category展开的表中插入数据。
3.统计对应类别(Music)中的视频热度。
最终sql:
创建表类别表:

create table ods_video_category(

    video_id string, 

    uploader string, 

    age int, 

    category_id string, 

    length int, 

    views int, 

    rate float, 

    ratings int, 

    comments int, 

    related_id array)

row format delimited 

fields terminated by "\t" 

collection items terminated by "&" 

stored as orc;

向类别表中插入数据:

insert into table ods_video_category  

select 

    video_id,

    uploader,

    age,

    category_id,

    length,

    views,

    rate,

    ratings,

    comments,

    related_id 

from ods_video_orc lateral view explode(category) catetory as category_id;

hive> select count(1) from ods_video_category; OK 1019206

统计Music类别的Top10(也可以统计其他)

select 

    video_id, 

    views

from ods_video_category 

where category_id = "Music" 

order by views desc limit 10;

统计每个类别中视频流量Top10,以Music为例

思路:
1.创建视频类别展开表(category_id列转行后的表)
2.按照ratings排序即可
最终sql:

select 

    video_id,

    views,

    ratings 

from ods_video_category 

where category_id = "Music" 

order by ratings desc limit 10;

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

思路:
1.先找到上传视频最多的10个用户的用户信息

select 

    * 

from ods_video_user_orc 

order by videos desc limit 10;

2.通过uploader字段与ods_video_orc表进行join,得到的信息按照views观看次数进行排序即可。
最终sql:

select 

    t2.video_id, 

    t2.views,

    t2.ratings,

    t1.videos,

    t1.friends 

from (

    select 

        * 

    from 

        ods_video_user_orc 

    order by videos desc 

    limit 10

) t1 join ods_video_orc t2 on t1.uploader = t2.uploader 

order by views desc limit 20;

统计每个类别视频观看数Top10

思路:
1.先得到category_id展开的表数据
2.子查询按照category_id进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank列
3.通过子查询产生的临时表,查询rank值小于等于10的数据行即可。
最终sql:

select 

    t1.* 

from (

    select 

        video_id,

        category_id,

        views,

        row_number() over(partition by category_id order by views desc

    ) rank 

    from ods_video_category

) t1 

where rank <= 10;

推荐阅读:


Hive企业级调优


HiveQL查询连续三天有销售记录的店铺


HiveQL实战蚂蚁森林低碳用户排名分析:解法一


HiveQL实战蚂蚁森林低碳用户排名分析:解法二


HiveQL实战蚂蚁森林植物申领统计分析


Hive-函数


Hive-查询


Hive-DML(Data Manipulation Language)数据操作语言


Hive-DDL(Data Definition Language)数据定义


Hive优化(整理版)


Spark Core之Shuffle解析


数据仓库开发规范

欢迎关注我的公众号“

知了小巷

”,如果喜欢,可以点一下“
在看
”~
谢谢~~