Hive Windowing and Analytics Function - Part 3
1.1 订单表统计
已知一个表orders,有如下字段:dt(date),order_id,user_id,amount。
数据样例:2021-01-01,10029028,1000003251,33.57。
1.1.1 给出 2021年每个月的订单数、用户数、总成交金额
思路:
先根据日期计算月份,再根据月份分组统计:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0: jdbc:hive2://0.0.0.0:10000> select month,
. . . . . . . . . . . . . . .> count(order_id) as order_count,
. . . . . . . . . . . . . . .> count(distinct(user_id)) as user_count,
. . . . . . . . . . . . . . .> sum(amount) as order_amount
. . . . . . . . . . . . . . .> from ( select order_id, user_id, amount, dt,
. . . . . . . . . . . . . . .> from_unixtime(unix_timestamp(dt, 'yyyy-MM-dd'),'yyyy-MM') as month from orders
. . . . . . . . . . . . . . .> ) t1 group by month;
...
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+----------+--------------+-------------+---------------+
| month | order_count | user_count | order_amount |
+----------+--------------+-------------+---------------+
| 2018-11 | 1 | 1 | 234.00 |
| 2021-01 | 3 | 2 | 100.71 |
| 2021-02 | 3 | 2 | 100.71 |
| 2021-11 | 1 | 1 | 234.00 |
+----------+--------------+-------------+---------------+
4 rows selected (12.162 seconds)
1.1.2 给出2021年11月的新客数(指在11月才有第一笔订单)
思路:
- 1,根据日期转换月份;
- 2,按照用户分区,并按日期排序;
- 3,分区内第一行月份是2021-11的用户即为新用户;
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0: jdbc:hive2://0.0.0.0:10000> select user_id, month, dt,rank from ( . . . . . . . . . . . . . . .> select user_id, dt,month, . . . . . . . . . . . . . . .> row_number() over(partition by user_id order by dt asc) rank from ( . . . . . . . . . . . . . . .> select order_id, user_id, amount, dt, from_unixtime(unix_timestamp(`dt`, 'yyyy-MM-dd'),'yyyy-MM') as month from orders . . . . . . . . . . . . . . .> )t1 . . . . . . . . . . . . . . .> )t2 where rank=1 and month='2021-11'; INFO : OK INFO : Concurrency mode is disabled, not creating a lock manager +------------+----------+-------------+-------+ | user_id | month | dt | rank | +------------+----------+-------------+-------+ | 100003253 | 2021-11 | 2021-11-02 | 1 | +------------+----------+-------------+-------+ 1 row selected (12.425 seconds)
1.2 用户年龄段电影偏好统计
需求:
有一个5000万的用户文件(字段:user_id,name,age)
一个2亿记录的用户看电影的记录文件(字段:user_id,url)
根据年龄段观看电影的次数进行排序?
思路:
1、表1中根据年龄计算年龄段;
2、表2中计算相同用户的观看次数;
3、join前两步结果,得到年龄段、观看次数;
4、再次合并相同年龄段的观看次数、排序;
注意:要先计算相同用户的观看次数再做join,大幅度降低结果积的大小。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0: jdbc:hive2://0.0.0.0:10000> select t1.age_phase, sum(t2.cnt) as count from
. . . . . . . . . . . . . . .> (select user_id, case
. . . . . . . . . . . . . . .> when age < 10 then '<10'
. . . . . . . . . . . . . . .> when age < 20 then '11-20'
. . . . . . . . . . . . . . .> when age < 30 then '21-30'
. . . . . . . . . . . . . . .> when age < 40 then '31-40'
. . . . . . . . . . . . . . .> when age < 50 then '41-50'
. . . . . . . . . . . . . . .> else '>50'
. . . . . . . . . . . . . . .> end as age_phase from test4user) t1
. . . . . . . . . . . . . . .> join
. . . . . . . . . . . . . . .> (select user_id, count(url) as cnt from test4log group by user_id) t2
. . . . . . . . . . . . . . .> on t1.user_id=t2.user_id
. . . . . . . . . . . . . . .> group by age_phase order by count desc;
...
INFO : OK
+---------------+--------+
| t1.age_phase | count |
+---------------+--------+
| 11-20 | 22 |
| 21-30 | 7 |
| 41-50 | 4 |
| 31-40 | 3 |
| >50 | 2 |
+---------------+--------+
1.3 用户首次购买的金额
需求:
请用sql写出所有用户中在今年10月份第一次购买商品的金额,
表orders1字段:
userid,money,paymenttime(格式:2021-10-01),orderid
思路:
1、先将付款日期转换为月份;
2、窗口函数以用户分区、按支付日期排序;
3、过滤出10月份的第一个订单
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0: jdbc:hive2://0.0.0.0:10000> select * from (
. . . . . . . . . . . . . . .> select *, row_number() over(partition by userid order by paymenttime asc) rank from (
. . . . . . . . . . . . . . .> select *, substr(paymenttime, 0,7) as payment_month from orders1
. . . . . . . . . . . . . . .> ) t1
. . . . . . . . . . . . . . .> )t2 where rank=1 and payment_month='2021-10';
...
INFO : OK
+------------+-----------+-----------------+-------------+-------------------+----------+
| t2.userid | t2.money | t2.paymenttime | t2.orderid | t2.payment_month | t2.rank |
+------------+-----------+-----------------+-------------+-------------------+----------+
| 001 | 100.00 | 2021-10-01 | 123 | 2021-10 | 1 |
| 002 | 500.00 | 2021-10-01 | 125 | 2021-10 | 1 |
+------------+-----------+-----------------+-------------+-------------------+----------+
2 rows selected (13.927 seconds)