1.[Hive]统计用户月访问量
统计每个用户 [月总访问次数],和 [从1月累计到当前月的总访问次数]
uid,date,vcnt
u01,2023-01-01,5
u02,2023-01-02,6
u02,2023-01-03,7
u02,2023-02-04,8
u03,2023-01-05,5
u04,2023-02-06,8
u01,2023-01-07,9
u01,2023-01-08,11
u02,2023-03-09,5
202,2023-04-10,2
统计每个用户 [月总访问次数]和[从1月累计到当前月的总访问次数]
uid,month,month_cnt,total_cnt
u01,2023-01,25,25
u01,2023-04,2,27
使用
substr
函数从日期字符串中提取年份和月份。使用
SUM
窗口函数计算每个用户每个月的访问总数以及从1月累计到当前月的总访问次数。
WITH monthly_data AS (
SELECT
uid,
CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2)) as month,
SUM(vcnt) OVER (PARTITION BY uid, CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2))) as month_cnt,
SUM(vcnt) OVER (PARTITION BY uid ORDER BY CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2))
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as total_cnt
FROM your_table_name
)
SELECT
uid,
month,
month_cnt,
total_cnt
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY uid, month ORDER BY month DESC) as rn
FROM monthly_data
) tmp
WHERE rn = 1
ORDER BY uid, month;
在这个查询中:
我们首先创建了一个名为
monthly_data
的CTE(公用表表达式),它包含uid
,month
,month_cnt
(每月的访问次数),和total_cnt
(从1月累计到当前月的总访问次数)。我们使用
SUM
窗口函数来计算每个用户的month_cnt
和total_cnt
,并通过PARTITION BY
子句将数据分区到每个用户和每个月。然后,我们从CTE中选择数据,并使用
ROW_NUMBER()
函数消除重复的行,因为对于每个用户和每个月,我们只想在最终结果中有一行。最后,我们根据
uid
和month
对结果进行排序。
请注意,您需要将your_table_name
替换为您的实际表名。此外,这个查询假设date
字段是字符串类型,格式为'YYYY-MM-DD'。如果date
的格式不同,或者它是日期类型,您需要相应地调整substr
函数的使用。
2.[Hive]一月连续登陆问题
with build_order as (
select '1' as user_id,'2022-10-01' AS dt
UNION ALL
select '1' as user_id,'2022-10-02' AS dt
UNION ALL
select '1' as user_id,'2022-10-03' AS dt
UNION ALL
select '2' as user_id,'2022-10-01' AS dt
UNION ALL
select '2' as user_id,'2022-10-02' AS dt
UNION ALL
select '3' as user_id,'2022-10-01' AS dt
UNION ALL
select '3' as user_id,'2022-10-02' AS dt
UNION ALL
select '3' as user_id,'2022-10-03' AS dt
UNION ALL
select '4' as user_id,'2022-10-01' AS dt
UNION ALL
select '5' as user_id,'2022-10-01' AS dt
UNION ALL
select '5' as user_id,'2022-10-02' AS dt
UNION ALL
select '6' as user_id,'2022-10-01' AS dt
UNION ALL
select '6' as user_id,'2022-10-02' AS dt
UNION ALL
select '6' as user_id,'2022-10-03' AS dt
UNION ALL
select '6' as user_id,'2022-10-04' AS dt
UNION ALL
select '6' as user_id,'2022-10-05' AS dt
)
,
--记录(先排重),升序排序
data_order as (
select
user_id,
dt as sign_date,
rank() over(partition by user_id order by dt asc) as rk
from
build_order
)
,
--日期-排序序号,若连续打卡则结果相同
--连续行为则日期相同,用来分组
cal_data_sim as (
select
user_id,
sign_date,
date_sub(sign_date, rk) as sign_group
from
data_order
)
--统计每次连续的次数
select user_id,sign_group,
min(sign_date) as sign_start_date,
max(sign_date) as sign_end_date,
count(1) as days
from cal_data_sim
group by user_id,sign_group
;
连续N天登陆
with t1 as (
select
userid,
logintime,
--本次登陆日期的第三天
date_add(logintime,2) as nextday,
--按照用户id分区,按照登陆日期排序,取下下一次登陆时间,取不到就为0
lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
from tb_login )
select distinct userid from t1 where nextday = nextlogin;
3.[Hive]时间交叉问题
如下为平台商品促销数据表:test1,
字段为:自增id,品牌,打折开始日期,打折结束日期
id,brand,stt,edt
1,oppo,2021-06-05,2021-06-09
2,oppo,2021-06-11,2021-06-21
3,vivo,2021-06-05,2021-06-15
4,vivo,2021-06-09,2021-06-21
5,redmi,2021-06-05,2021-06-21
6,redmi,2021-06-09,2021-06-15
7,redmi,2021-06-17,2021-06-26
8,huawei,2021-06-05,2021-06-26
9,huawei,2021-06-09,2021-06-15
10,huawei,2021-06-17,2021-06-21
计算每个品牌总的打折销售天数,注意其中的交叉日期,比如 vivo 品牌,第一次活动时间为 2021-06-05 到 2021-06-15,第二次活动时间为 2021-06-09 到 2021-06-21 其中 9 号到 15号为重复天数,只统计一次,即 vivo 总打折天数为 2021-06-05 到 2021-06-21 共计 17 天。
select
brand,
1 + sum(if(mdt >= edt,0,if(stt >= mdt, datediff(edt, stt), datediff(edt, mdt)))) as active_days
from
(
select
brand,
stt,
edt,
nvl(max(edt) over(partition by brand order by stt rows between unbounded preceding and 1 preceding),stt) as mdt
from
b
) t1
group by
brand
4.[Java]中文金额转阿拉伯金额
# constants for chinese_to_arabic
# author: kean
CN_NUM = {
'〇': 0, '一': 1, '二': 2, '三': 3, '四': 4, '五': 5, '六': 6, '七': 7, '八': 8, '九': 9, '零': 0,
'壹': 1, '贰': 2, '叁': 3, '肆': 4, '伍': 5, '陆': 6, '柒': 7, '捌': 8, '玖': 9, '貮': 2, '两': 2,
}
CN_UNIT = {
'十': 10,
'拾': 10,
'百': 100,
'佰': 100,
'千': 1000,
'仟': 1000,
'万': 10000,
'萬': 10000,
'亿': 100000000,
'億': 100000000,
'兆': 1000000000000,
}
def chinese_to_arabic(cn: str) -> int:
unit = 0 # current
ldig = [] # digest
for cndig in reversed(cn):
if cndig in CN_UNIT:
unit = CN_UNIT.get(cndig)
if unit == 10000 or unit == 100000000:
ldig.append(unit)
unit = 1
else:
dig = CN_NUM.get(cndig)
if unit:
dig *= unit
unit = 0
ldig.append(dig)
if unit == 10:
ldig.append(10)
val, tmp = 0, 0
for x in reversed(ldig):
if x == 10000 or x == 100000000:
val += tmp * x
tmp = 0
else:
tmp += x
val += tmp
return val
# TODO: make a full unittest
def test():
test_dig = ['八',
'十一',
'一百二十三',
'一千二百零三',
'一万一千一百零一',
'十万零三千六百零九',
'一百二十三万四千五百六十七',
'一千一百二十三万四千五百六十七',
'一亿一千一百二十三万四千五百六十七',
'一百零二亿五千零一万零一千零三十八']
for cn in test_dig:
x = chinese_to_arabic(cn)
print(cn, x)
# assert x == 10250011038
if __name__ == '__main__':
test()
5.[Hive]累积金额问题
Function (arg1,…, argn) 可以是下面的函数
Aggregate Functions: 聚合函数,比如:sum(…)、 max(…)、min(…)、avg(…)等.
Sort Functions: 数据排序函数, 比如 :rank(…)、row_number(…)等.
Analytics Functions: 统计和比较函数, 比如:lead(…)、lag(…)、 first_value(…)等.
OVER ([PARTITION BY <…>] [ORDER BY <….>] 其中包括以下可选项
PARTITION BY 表示将数据先按 字段 进行分区
ORDER BY 表示将各个分区内的数据按 排序字段 进行排序
window_expression 窗口边界
名词 | 含义 |
---|---|
preceding | 往前 |
following | 往后 |
current row | 当前行 |
unbounded | 起点 |
unbounded preceding | 从前面的起点 |
unbounded following | 到后面的终点 |
窗口边界使用详解
如果不指定 PARTITION BY,则不对数据进行分区,换句话说,所有数据看作同一个分区;
如果不指定 ORDER BY,则不对各分区做排序,通常用于那些顺序无关的窗口函数,例如 SUM()
如果不指定窗口子句,则默认采用以下的窗口定义:
a、若不指定 ORDER BY,默认使用分区内所有行 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
b、若指定了 ORDER BY,默认使用分区内第一行到当前值 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
6.[Hive]相互关注的人
用户相互关注表结构:id,from_user,to_user 分别表示唯一id,本用户,关注的用户,数据为样式为:
1,A,B
2,B,A
3,C,B
查找相互关注用户逻辑
-- 处理相互关注的前后问题,处理使feature值一样
WITH concat_user AS (
select
from_user,
to_user,
if(hash(from_user) > hash(to_user),concat(from_user, to_user),concat(to_user, from_user)) as feather
from
table_user
)
-- 通过concat查找关注关系
select
from_user,
to_user
from
concat_user
where
if(sum(1) over(partition by feather) > 1, 1, 0) = 1;
7.[Java]合并两个有序链表
public static Node mergeTwoOrderlyNode(Node node1, Node node2) {
Node newNode = new Node(-1);
Node temp = newNode;
while (node1 != null && node2 != null) {
if(node1.val< node2.val){
temp.next = node1;
node1 = node1.next;
}else {
temp.next = node2;
node2 = node2.next;
}
temp = temp.next;
}
if (node1==null){
temp.next = node1;
}
if (node2==null){
temp.next = node2;
}
return newNode.next;
}
8.[Hive]查询销售额排行的前、后3名产品ID
select
pid,
ROW_NUMBER() over (order by sum(amount)) rk1,
ROW_NUMBER() over (order by sum(amount) desc) rk2
from
pro
group by
pid
9.[Hive]统计在线高峰期人数及时间段
with
t2 as
(
select at_time
,lead(at_time, 1, at_time) over(order by at_time) lead_time
,sum(if(type = 'login', 1, -1)) over(order by at_time rows between unbounded preceding and current row) online_ct
from (
select id,start_time at_time,'login' type from player
union all
select id,end_time at_time,'logout' type from player
order by at_time
) t1
)
select at_time,lead_time,online_ct
from t2
where t2.online_ct in (select max(online_ct) max_ct from t2)
10.[Java]股票最大交易额
public static int maxProfit(int[] prices) {
//1. minPrice记录股票历史最低值
int minPrice = prices[0];
//2. maxProfit记录股票最大收益
int maxProfit = 0;
//3. 遍历股票prices,动态修改minPrice和maxProfit
int buying = 0;
// 买入时间
int sale = 0;
// 卖出时间
int editIndex = 0;
// 记录天数
for (int price : prices) {
editIndex+=1;
//4. 如果当天的price小于minPrice,则修改minPrice;
if (price < minPrice) {
minPrice = price;
buying = editIndex;
}
//5. 如果当天的收益大于maxProfit,则修改maxProfit
int profit = price - minPrice;
if (profit > maxProfit) {
maxProfit = profit;
sale = editIndex;
}
}
return maxProfit;
}
11.[Hive]连续重复数据
-- 加载数据前一行和后一行处理
with cal_lad_lead_tab as (
select `timestamp`,id,grp,lag(grp) OVER w as prev_grp,lead(grp) OVER w as next_grp from df WINDOW w AS ( ORDER BY Timestamp)
),
-- 计算相同数据
cal_alike as (
select Timestamp
,id
,grp
,sum(case
when (grp = prev_grp and grp = next_grp) then 0
when (grp = prev_grp and grp != next_grp) then 1
when (grp = prev_grp and (grp != next_grp or next_grp is null)) then 0 else 1
over(order by `timestamp` rows between UNBOUNDED PRECEDING and CURRENT ROW)+1) as event_seq
from cal_lad_lead_tab order by `timestamp`
),
-- 查询最终数据
select id,collect_list(grp) as cli_grp from cal_lad_lead_tab group by event_seq,id;
12.[Java]两个链表之和
public ListNode addTwoNumbers(ListNode l1, ListNode l2) {
int carry = 0;//表示进位
ListNode head = null, tail = null;
while (l1 != null || l2 != null) {
//如果其中有一个到达结尾了,那么这个链表这一位的的数字就为0。
int x = l1 == null ? 0 : l1.val;
int y = l2 == null ? 0 : l2.val;
int sum = x + y + carry;
if (head == null) {//第一次插入
head = tail = new ListNode(sum % 10);
} else {
tail.next = new ListNode(sum % 10);
tail = tail.next;
}
carry = sum / 10;
if (l1 != null) {
l1 = l1.next;
}
if (l2 != null) {
l2 = l2.next;
}
}
if (carry == 1) tail.next = new ListNode(carry);
return head;
}
13.[Java]最长连续序列
【动态规划】O(n) 复杂度。
用哈希表存储每个端点值对应连续区间的长度
若数已在哈希表中:跳过不做处理
若是新数加入:
取出其左右相邻数已有的连续区间长度 left 和 right
计算当前数的区间长度为:cur_length = left + right + 1
根据 cur_length 更新最大长度 max_length 的值
更新区间两端点的长度值
class Solution(object):
def longestConsecutive(self, nums):
hash_dict = dict()
max_length = 0
for num in nums:
if num not in hash_dict:
left = hash_dict.get(num - 1, 0)
right = hash_dict.get(num + 1, 0)
cur_length = 1 + left + right
if cur_length > max_length:
max_length = cur_length
hash_dict[num] = cur_length
hash_dict[num - left] = cur_length
hash_dict[num + right] = cur_length
return max_length
14、[Hive]行转列 列转行
有数据表t1有以下数据内容
user_name course score
user1 语文 89
user1 数学 92
user1 英语 87
user2 语文 96
user2 数学 84
user2 英语 99
1.将course, score多行转一列
用户 课程 分数
user1 | 语文,数学,英语 | 89,92,87
user2 | 语文,数学,英语 | 89,92,87
2. 将上面的结果再还原回去
15、[Hive]连续三天登陆的用户
user_id,login_date
1,2021-10-01
2,2021-10-03
1,2021-10-02
1,2021-10-03
2,2021-10-06
2,2021-10-07
2,2021-10-08
3,2021-10-08
求连续3天登录的所有用户个数
16、[Hive]说出答案并说说为什么
表t1中字段a数据如下:
null
111
222
333
问题一:
select sum(a), avg(a), min(a), max(a), count(a), count(1), count(*) from t1;
问题二:
select 1 + null + 2;