11.4.4 统计视频观看数Top50所关联视频的所属类别Rank
目标:
- 找到 top50 视频的关联视频 id, 然后对这些关联视频的类别分组统计视频的个数, 然后排序 Rank
分析:
- 找到 top50 视频: 只相关视频就可以了. 记为t1
- 炸裂相关视频id. 记为表t2
- t2与原表 join 查找到每个相关视频的类别. 记为表t3
- 炸裂类别, 按照类别分类,统计每个类别出现的次数
- 添加rank
- 找到 top50 视频: 只view和相关视频就可以了. // t1
select
views,
relatedId
from gulivideo_orc
order by views desc
limit 50; // t1
- 炸裂相关视频id. 记为表t1
select
distinct(videoId)
from t1
lateral view explode(relatedId) temp as videoId; // t2
- t2与原表 join 查找到每个相关视频的类别. 记为表t3
select
t2.videoId,
gulivideo_orc.category
from t2 join gulivideo_orc
on t2.videoId = gulivideo_orc.videoId; // t3
- 炸裂类别, 统计每个类别出现的次数
select
category_name,
count(*) num
from t3
lateral view explode(category) tmp as category_name
group by category_name; // t4
- 添加 rank 排名
select
category_name,
num,
rank() over(order by num)
from t4;
合并成一个语句:
select
category_name,
num,
rank() over(order by num desc)
from(
select
category_name,
count(*) num
from(
select
t2.videoId,
gulivideo_orc.category
from (
select
distinct(videoId)
from (
select
views,
relatedId
from gulivideo_orc
order by views desc
limit 50
) t1
lateral view explode(relatedId) temp as videoId
)t2 join gulivideo_orc
on t2.videoId = gulivideo_orc.videoId
)t3
lateral view explode(category) tmp as category_name
group by category_name
)t4;
+----------------+------+----------------+
| category_name | num | rank_window_0 |
+----------------+------+----------------+
| Comedy | 232 | 1 |
| Entertainment | 216 | 2 |
| Music | 195 | 3 |
| Blogs | 51 | 4 |
| People | 51 | 4 |
| Film | 47 | 6 |
| Animation | 47 | 6 |
| News | 22 | 8 |
| Politics | 22 | 8 |
| Games | 20 | 10 |
| Gadgets | 20 | 10 |
| Sports | 19 | 12 |
| Howto | 14 | 13 |
| DIY | 14 | 13 |
| UNA | 13 | 15 |
| Places | 12 | 16 |
| Travel | 12 | 16 |
| Animals | 11 | 18 |
| Pets | 11 | 18 |
| Autos | 4 | 20 |
| Vehicles | 4 | 20 |
+----------------+------+----------------+
21 rows selected (100.953 seconds)
注意:
如果抛异常: Java heap space
在 yarn-site.xml
添加如下配置
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>