1.[Hive]统计用户月访问量

统计每个用户 [月总访问次数],和 [从1月累计到当前月的总访问次数]

uid,date,vcnt

u01,2023-01-01,5

u02,2023-01-02,6

u02,2023-01-03,7

u02,2023-02-04,8

u03,2023-01-05,5

u04,2023-02-06,8

u01,2023-01-07,9

u01,2023-01-08,11

u02,2023-03-09,5

202,2023-04-10,2

统计每个用户 [月总访问次数]和[从1月累计到当前月的总访问次数]

uid,month,month_cnt,total_cnt

u01,2023-01,25,25

u01,2023-04,2,27
  1. 使用substr函数从日期字符串中提取年份和月份。

  2. 使用SUM窗口函数计算每个用户每个月的访问总数以及从1月累计到当前月的总访问次数。

WITH monthly_data AS (
    SELECT 
        uid, 
        CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2)) as month, 
        SUM(vcnt) OVER (PARTITION BY uid, CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2))) as month_cnt,
        SUM(vcnt) OVER (PARTITION BY uid ORDER BY CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2)) 
                        RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as total_cnt
    FROM your_table_name
)
SELECT 
    uid, 
    month, 
    month_cnt, 
    total_cnt 
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY uid, month ORDER BY month DESC) as rn 
    FROM monthly_data
) tmp 
WHERE rn = 1 
ORDER BY uid, month;

在这个查询中:

  • 我们首先创建了一个名为monthly_data的CTE(公用表表达式),它包含uidmonthmonth_cnt(每月的访问次数),和total_cnt(从1月累计到当前月的总访问次数)。

  • 我们使用SUM窗口函数来计算每个用户的month_cnttotal_cnt,并通过PARTITION BY子句将数据分区到每个用户和每个月。

  • 然后,我们从CTE中选择数据,并使用ROW_NUMBER()函数消除重复的行,因为对于每个用户和每个月,我们只想在最终结果中有一行。

  • 最后,我们根据uidmonth对结果进行排序。

请注意,您需要将your_table_name替换为您的实际表名。此外,这个查询假设date字段是字符串类型,格式为'YYYY-MM-DD'。如果date的格式不同,或者它是日期类型,您需要相应地调整substr函数的使用。

2.[Hive]一月连续登陆问题

with build_order as (
    select '1' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '1' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '1' as user_id,'2022-10-03' AS dt
    UNION ALL 
    select '2' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '2' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '3' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '3' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '3' as user_id,'2022-10-03' AS dt
    UNION ALL 
    select '4' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '5' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '5' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-03' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-04' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-05' AS dt
    )
    ,
    --记录(先排重),升序排序 
    data_order as (
    select
        user_id,
        dt as sign_date,
        rank() over(partition by user_id order by dt asc) as rk
    from
        build_order
    )
    ,
    --日期-排序序号,若连续打卡则结果相同 
    --连续行为则日期相同,用来分组
    cal_data_sim as (
    select
        user_id,
        sign_date,
        date_sub(sign_date, rk) as sign_group
    from
        data_order
    )
    
--统计每次连续的次数 
select user_id,sign_group,
       min(sign_date) as sign_start_date,
       max(sign_date) as sign_end_date,
       count(1) as days 
from  cal_data_sim
group by user_id,sign_group 
;


连续N天登陆
with t1 as (
select
  userid,
  logintime,
  --本次登陆日期的第三天
  date_add(logintime,2) as nextday,
  --按照用户id分区,按照登陆日期排序,取下下一次登陆时间,取不到就为0
  lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
from tb_login )
select distinct userid from t1 where nextday = nextlogin;

3.[Hive]时间交叉问题

如下为平台商品促销数据表:test1,

字段为:自增id,品牌,打折开始日期,打折结束日期

id,brand,stt,edt

1,oppo,2021-06-05,2021-06-09

2,oppo,2021-06-11,2021-06-21

3,vivo,2021-06-05,2021-06-15

4,vivo,2021-06-09,2021-06-21

5,redmi,2021-06-05,2021-06-21

6,redmi,2021-06-09,2021-06-15

7,redmi,2021-06-17,2021-06-26

8,huawei,2021-06-05,2021-06-26

9,huawei,2021-06-09,2021-06-15

10,huawei,2021-06-17,2021-06-21

计算每个品牌总的打折销售天数,注意其中的交叉日期,比如 vivo 品牌,第一次活动时间为 2021-06-05 到 2021-06-15,第二次活动时间为 2021-06-09 到 2021-06-21 其中 9 号到 15号为重复天数,只统计一次,即 vivo 总打折天数为 2021-06-05 到 2021-06-21 共计 17 天。

select
    brand,
    1 + sum(if(mdt >= edt,0,if(stt >= mdt, datediff(edt, stt), datediff(edt, mdt)))) as active_days
from
    (
        select
            brand,
            stt,
            edt,
            nvl(max(edt) over(partition by brand order by stt rows between unbounded preceding and 1 preceding),stt) as mdt
        from
            b
    ) t1
group by
    brand

4.[Java]中文金额转阿拉伯金额

# constants for chinese_to_arabic
# author: kean
CN_NUM = {
    '〇': 0, '一': 1, '二': 2, '三': 3, '四': 4, '五': 5, '六': 6, '七': 7, '八': 8, '九': 9, '零': 0,
    '壹': 1, '贰': 2, '叁': 3, '肆': 4, '伍': 5, '陆': 6, '柒': 7, '捌': 8, '玖': 9, '貮': 2, '两': 2,
}

CN_UNIT = {
    '十': 10,
    '拾': 10,
    '百': 100,
    '佰': 100,
    '千': 1000,
    '仟': 1000,
    '万': 10000,
    '萬': 10000,
    '亿': 100000000,
    '億': 100000000,
    '兆': 1000000000000,
}


def chinese_to_arabic(cn: str) -> int:
    unit = 0  # current
    ldig = []  # digest
    for cndig in reversed(cn):
        if cndig in CN_UNIT:
            unit = CN_UNIT.get(cndig)
            if unit == 10000 or unit == 100000000:
                ldig.append(unit)
                unit = 1
        else:
            dig = CN_NUM.get(cndig)
            if unit:
                dig *= unit
                unit = 0
            ldig.append(dig)
    if unit == 10:
        ldig.append(10)
    val, tmp = 0, 0
    for x in reversed(ldig):
        if x == 10000 or x == 100000000:
            val += tmp * x
            tmp = 0
        else:
            tmp += x
    val += tmp
    return val


# TODO: make a full unittest
def test():
    test_dig = ['八',
                '十一',
                '一百二十三',
                '一千二百零三',
                '一万一千一百零一',
                '十万零三千六百零九',
                '一百二十三万四千五百六十七',
                '一千一百二十三万四千五百六十七',
                '一亿一千一百二十三万四千五百六十七',
                '一百零二亿五千零一万零一千零三十八']
    for cn in test_dig:
        x = chinese_to_arabic(cn)
        print(cn, x)
        # assert x == 10250011038


if __name__ == '__main__':
    test()

5.[Hive]累积金额问题

Function (arg1,…, argn) 可以是下面的函数

Aggregate Functions: 聚合函数,比如:sum(…)、 max(…)、min(…)、avg(…)等.
Sort Functions: 数据排序函数, 比如 :rank(…)、row_number(…)等.
Analytics Functions: 统计和比较函数, 比如:lead(…)、lag(…)、 first_value(…)等.

OVER ([PARTITION BY <…>] [ORDER BY <….>] 其中包括以下可选项

PARTITION BY 表示将数据先按 字段 进行分区
ORDER BY 表示将各个分区内的数据按 排序字段 进行排序

window_expression 窗口边界

名词

含义

preceding

往前

following

往后

current row

当前行

unbounded

起点

unbounded preceding

从前面的起点

unbounded following

到后面的终点

窗口边界使用详解

  1. 如果不指定 PARTITION BY,则不对数据进行分区,换句话说,所有数据看作同一个分区;

  2. 如果不指定 ORDER BY,则不对各分区做排序,通常用于那些顺序无关的窗口函数,例如 SUM()

  3. 如果不指定窗口子句,则默认采用以下的窗口定义:

a、若不指定 ORDER BY,默认使用分区内所有行 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

b、若指定了 ORDER BY,默认使用分区内第一行到当前值 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

6.[Hive]相互关注的人

用户相互关注表结构:id,from_user,to_user 分别表示唯一id,本用户,关注的用户,数据为样式为:

1,A,B

2,B,A

3,C,B

查找相互关注用户逻辑


-- 处理相互关注的前后问题,处理使feature值一样
WITH concat_user AS (
select
    from_user,
    to_user,
    if(hash(from_user) > hash(to_user),concat(from_user, to_user),concat(to_user, from_user)) as feather
from
    table_user 
)

-- 通过concat查找关注关系
select
    from_user,
    to_user
from
    concat_user
where
    if(sum(1) over(partition by feather) > 1, 1, 0) = 1;

7.[Java]合并两个有序链表

    public static Node mergeTwoOrderlyNode(Node node1, Node node2) {
        Node newNode = new Node(-1);
        Node temp = newNode;
        while (node1 != null && node2 != null) {
            if(node1.val< node2.val){
                temp.next = node1;
                node1 = node1.next;
            }else {
                temp.next = node2;
                node2 = node2.next;
            }
            temp = temp.next;
        }
        if (node1==null){
            temp.next = node1;
        }
        if (node2==null){
            temp.next = node2;
        }
        return newNode.next;
    }

8.[Hive]查询销售额排行的前、后3名产品ID

select
    pid,
    ROW_NUMBER() over (order by sum(amount)) rk1,
    ROW_NUMBER() over (order by sum(amount) desc) rk2
from
    pro
group by
    pid

9.[Hive]统计在线高峰期人数及时间段

with
t2 as 
(
    select  at_time
            ,lead(at_time, 1, at_time) over(order by at_time) lead_time
            ,sum(if(type = 'login', 1, -1)) over(order by at_time rows between unbounded preceding and current row) online_ct
    from    (
        select  id,start_time at_time,'login' type from    player
        union all 
        select  id,end_time at_time,'logout' type from    player
        order by at_time
    ) t1
)

select  at_time,lead_time,online_ct
from    t2
where   t2.online_ct in (select max(online_ct) max_ct from t2)

10.[Java]股票最大交易额

    public static int maxProfit(int[] prices) {
        //1. minPrice记录股票历史最低值
        int minPrice = prices[0];
        //2. maxProfit记录股票最大收益
        int maxProfit = 0;
        //3. 遍历股票prices,动态修改minPrice和maxProfit
        int buying = 0;
        // 买入时间
        int sale = 0;
        // 卖出时间
        int editIndex = 0;
        // 记录天数
        for (int price : prices) {
            editIndex+=1;
            //4. 如果当天的price小于minPrice,则修改minPrice;
            if (price < minPrice) {
                minPrice = price;
                buying = editIndex;
            }
            //5. 如果当天的收益大于maxProfit,则修改maxProfit
            int profit = price - minPrice;
            if (profit > maxProfit) {
                maxProfit = profit;
                sale = editIndex;
            }
        }
        return maxProfit;
    }

11.[Hive]连续重复数据

-- 加载数据前一行和后一行处理
with cal_lad_lead_tab as (
select `timestamp`,id,grp,lag(grp) OVER w as prev_grp,lead(grp) OVER w as next_grp  from df  WINDOW w AS ( ORDER BY Timestamp)
),
-- 计算相同数据
cal_alike as (
select Timestamp
       ,id
       ,grp
       ,sum(case 
        when (grp = prev_grp and grp = next_grp) then 0
        when (grp = prev_grp and grp != next_grp) then 1
        when (grp = prev_grp and (grp != next_grp or next_grp is null)) then 0 else 1
        over(order by `timestamp` rows between UNBOUNDED PRECEDING and CURRENT ROW)+1) as event_seq
       from cal_lad_lead_tab order by `timestamp`
),
-- 查询最终数据
select id,collect_list(grp) as cli_grp from cal_lad_lead_tab group by event_seq,id;

12.[Java]两个链表之和

public ListNode addTwoNumbers(ListNode l1, ListNode l2) {
        int carry = 0;//表示进位
        ListNode head = null, tail = null;

        while (l1 != null || l2 != null) {
            //如果其中有一个到达结尾了,那么这个链表这一位的的数字就为0。
            int x = l1 == null ? 0 : l1.val;
            int y = l2 == null ? 0 : l2.val;
            int sum = x + y + carry;

            if (head == null) {//第一次插入
                head = tail = new ListNode(sum % 10);
            } else {
                tail.next = new ListNode(sum % 10);
                tail = tail.next;
            }
            carry = sum / 10;
            if (l1 != null) {
                l1 = l1.next;
            }
            if (l2 != null) {
                l2 = l2.next;
            }
        }
        if (carry == 1) tail.next = new ListNode(carry);
        return head;
    }

13.[Java]最长连续序列

【动态规划】O(n) 复杂度。

  • 用哈希表存储每个端点值对应连续区间的长度

  • 若数已在哈希表中:跳过不做处理

  • 若是新数加入:

    • 取出其左右相邻数已有的连续区间长度 left 和 right

    • 计算当前数的区间长度为:cur_length = left + right + 1

    • 根据 cur_length 更新最大长度 max_length 的值

    • 更新区间两端点的长度值

class Solution(object):
    def longestConsecutive(self, nums):
        hash_dict = dict()
        max_length = 0
        for num in nums:
            if num not in hash_dict:
                left = hash_dict.get(num - 1, 0)
                right = hash_dict.get(num + 1, 0)
                cur_length = 1 + left + right
                if cur_length > max_length:
                    max_length = cur_length
                hash_dict[num] = cur_length
                hash_dict[num - left] = cur_length
                hash_dict[num + right] = cur_length
        return max_length

14、[Hive]行转列 列转行

有数据表t1有以下数据内容

user_name course score 
user1 语文 89
user1 数学 92
user1 英语 87
user2 语文 96
user2 数学 84
user2 英语 99

1.将course, score多行转一列

用户 课程 分数
user1 | 语文,数学,英语 | 89,92,87
user2 | 语文,数学,英语 | 89,92,87

2. 将上面的结果再还原回去

15、[Hive]连续三天登陆的用户

user_id,login_date
1,2021-10-01
2,2021-10-03
1,2021-10-02
1,2021-10-03
2,2021-10-06
2,2021-10-07
2,2021-10-08
3,2021-10-08

求连续3天登录的所有用户个数

16、[Hive]说出答案并说说为什么

表t1中字段a数据如下:

null
111
222
333

问题一:

select sum(a), avg(a), min(a), max(a), count(a), count(1), count(*) from t1;

问题二:

select 1 + null + 2;