面试-笔试题集锦

1.[Hive]统计用户月访问量

统计每个用户 [月总访问次数]，和 [从1月累计到当前月的总访问次数]

uid,date,vcnt

u01,2023-01-01,5

u02,2023-01-02,6

u02,2023-01-03,7

u02,2023-02-04,8

u03,2023-01-05,5

u04,2023-02-06,8

u01,2023-01-07,9

u01,2023-01-08,11

u02,2023-03-09,5

202,2023-04-10,2

统计每个用户 [月总访问次数]和[从1月累计到当前月的总访问次数]

uid,month,month_cnt,total_cnt

u01,2023-01,25,25

u01,2023-04,2,27

使用substr函数从日期字符串中提取年份和月份。
使用SUM窗口函数计算每个用户每个月的访问总数以及从1月累计到当前月的总访问次数。

WITH monthly_data AS (
    SELECT 
        uid, 
        CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2)) as month, 
        SUM(vcnt) OVER (PARTITION BY uid, CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2))) as month_cnt,
        SUM(vcnt) OVER (PARTITION BY uid ORDER BY CONCAT(substr(date, 1, 4), '-', substr(date, 6, 2)) 
                        RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as total_cnt
    FROM your_table_name
)
SELECT 
    uid, 
    month, 
    month_cnt, 
    total_cnt 
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY uid, month ORDER BY month DESC) as rn 
    FROM monthly_data
) tmp 
WHERE rn = 1 
ORDER BY uid, month;

在这个查询中：

我们首先创建了一个名为monthly_data的CTE（公用表表达式），它包含uid，month，month_cnt（每月的访问次数），和total_cnt（从1月累计到当前月的总访问次数）。
我们使用SUM窗口函数来计算每个用户的month_cnt和total_cnt，并通过PARTITION BY子句将数据分区到每个用户和每个月。
然后，我们从CTE中选择数据，并使用ROW_NUMBER()函数消除重复的行，因为对于每个用户和每个月，我们只想在最终结果中有一行。
最后，我们根据uid和month对结果进行排序。

请注意，您需要将your_table_name替换为您的实际表名。此外，这个查询假设date字段是字符串类型，格式为'YYYY-MM-DD'。如果date的格式不同，或者它是日期类型，您需要相应地调整substr函数的使用。

2.[Hive]一月连续登陆问题

with build_order as (
    select '1' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '1' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '1' as user_id,'2022-10-03' AS dt
    UNION ALL 
    select '2' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '2' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '3' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '3' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '3' as user_id,'2022-10-03' AS dt
    UNION ALL 
    select '4' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '5' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '5' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-01' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-02' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-03' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-04' AS dt
    UNION ALL 
    select '6' as user_id,'2022-10-05' AS dt
    )
    ,
    --记录(先排重)，升序排序 
    data_order as (
    select
        user_id,
        dt as sign_date,
        rank() over(partition by user_id order by dt asc) as rk
    from
        build_order
    )
    ,
    --日期-排序序号，若连续打卡则结果相同 
    --连续行为则日期相同，用来分组
    cal_data_sim as (
    select
        user_id,
        sign_date,
        date_sub(sign_date, rk) as sign_group
    from
        data_order
    )
    
--统计每次连续的次数 
select user_id,sign_group,
       min(sign_date) as sign_start_date,
       max(sign_date) as sign_end_date,
       count(1) as days 
from  cal_data_sim
group by user_id,sign_group 
;


连续N天登陆
with t1 as (
select
  userid,
  logintime,
  --本次登陆日期的第三天
  date_add(logintime,2) as nextday,
  --按照用户id分区，按照登陆日期排序，取下下一次登陆时间，取不到就为0
  lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
from tb_login )
select distinct userid from t1 where nextday = nextlogin;

3.[Hive]时间交叉问题

如下为平台商品促销数据表：test1，

字段为：自增id，品牌，打折开始日期，打折结束日期

id,brand,stt,edt

1,oppo,2021-06-05,2021-06-09

2,oppo,2021-06-11,2021-06-21

3,vivo,2021-06-05,2021-06-15

4,vivo,2021-06-09,2021-06-21

5,redmi,2021-06-05,2021-06-21

6,redmi,2021-06-09,2021-06-15

7,redmi,2021-06-17,2021-06-26

8,huawei,2021-06-05,2021-06-26

9,huawei,2021-06-09,2021-06-15

10,huawei,2021-06-17,2021-06-21

计算每个品牌总的打折销售天数，注意其中的交叉日期，比如 vivo 品牌，第一次活动时间为 2021-06-05 到 2021-06-15，第二次活动时间为 2021-06-09 到 2021-06-21 其中 9 号到 15号为重复天数，只统计一次，即 vivo 总打折天数为 2021-06-05 到 2021-06-21 共计 17 天。

select
    brand,
    1 + sum(if(mdt >= edt,0,if(stt >= mdt, datediff(edt, stt), datediff(edt, mdt)))) as active_days
from
    (
        select
            brand,
            stt,
            edt,
            nvl(max(edt) over(partition by brand order by stt rows between unbounded preceding and 1 preceding),stt) as mdt
        from
            b
    ) t1
group by
    brand

4.[Java]中文金额转阿拉伯金额

# constants for chinese_to_arabic
# author: kean
CN_NUM = {
    '〇': 0, '一': 1, '二': 2, '三': 3, '四': 4, '五': 5, '六': 6, '七': 7, '八': 8, '九': 9, '零': 0,
    '壹': 1, '贰': 2, '叁': 3, '肆': 4, '伍': 5, '陆': 6, '柒': 7, '捌': 8, '玖': 9, '貮': 2, '两': 2,
}

CN_UNIT = {
    '十': 10,
    '拾': 10,
    '百': 100,
    '佰': 100,
    '千': 1000,
    '仟': 1000,
    '万': 10000,
    '萬': 10000,
    '亿': 100000000,
    '億': 100000000,
    '兆': 1000000000000,
}


def chinese_to_arabic(cn: str) -> int:
    unit = 0  # current
    ldig = []  # digest
    for cndig in reversed(cn):
        if cndig in CN_UNIT:
            unit = CN_UNIT.get(cndig)
            if unit == 10000 or unit == 100000000:
                ldig.append(unit)
                unit = 1
        else:
            dig = CN_NUM.get(cndig)
            if unit:
                dig *= unit
                unit = 0
            ldig.append(dig)
    if unit == 10:
        ldig.append(10)
    val, tmp = 0, 0
    for x in reversed(ldig):
        if x == 10000 or x == 100000000:
            val += tmp * x
            tmp = 0
        else:
            tmp += x
    val += tmp
    return val


# TODO: make a full unittest
def test():
    test_dig = ['八',
                '十一',
                '一百二十三',
                '一千二百零三',
                '一万一千一百零一',
                '十万零三千六百零九',
                '一百二十三万四千五百六十七',
                '一千一百二十三万四千五百六十七',
                '一亿一千一百二十三万四千五百六十七',
                '一百零二亿五千零一万零一千零三十八']
    for cn in test_dig:
        x = chinese_to_arabic(cn)
        print(cn, x)
        # assert x == 10250011038


if __name__ == '__main__':
    test()

5.[Hive]累积金额问题

Function (arg1,…, argn) 可以是下面的函数

Aggregate Functions: 聚合函数,比如：sum(…)、 max(…)、min(…)、avg(…)等.
Sort Functions: 数据排序函数, 比如 ：rank(…)、row_number(…)等.
Analytics Functions: 统计和比较函数, 比如：lead(…)、lag(…)、 first_value(…)等.

OVER ([PARTITION BY <…>] [ORDER BY <….>] 其中包括以下可选项

PARTITION BY 表示将数据先按 字段 进行分区
ORDER BY 表示将各个分区内的数据按 排序字段 进行排序

window_expression 窗口边界

名词	含义
preceding	往前
following	往后
current row	当前行
unbounded	起点
unbounded preceding	从前面的起点
unbounded following	到后面的终点

窗口边界使用详解

如果不指定 PARTITION BY，则不对数据进行分区，换句话说，所有数据看作同一个分区；
如果不指定 ORDER BY，则不对各分区做排序，通常用于那些顺序无关的窗口函数，例如 SUM()
如果不指定窗口子句，则默认采用以下的窗口定义：

a、若不指定 ORDER BY，默认使用分区内所有行 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

b、若指定了 ORDER BY，默认使用分区内第一行到当前值 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

6.[Hive]相互关注的人

用户相互关注表结构：id,from_user,to_user 分别表示唯一id，本用户，关注的用户，数据为样式为：

1,A,B

2,B,A

3,C,B

查找相互关注用户逻辑


-- 处理相互关注的前后问题，处理使feature值一样
WITH concat_user AS (
select
    from_user,
    to_user,
    if(hash(from_user) > hash(to_user),concat(from_user, to_user),concat(to_user, from_user)) as feather
from
    table_user 
)

-- 通过concat查找关注关系
select
    from_user,
    to_user
from
    concat_user
where
    if(sum(1) over(partition by feather) > 1, 1, 0) = 1;

7.[Java]合并两个有序链表

    public static Node mergeTwoOrderlyNode(Node node1, Node node2) {
        Node newNode = new Node(-1);
        Node temp = newNode;
        while (node1 != null && node2 != null) {
            if(node1.val< node2.val){
                temp.next = node1;
                node1 = node1.next;
            }else {
                temp.next = node2;
                node2 = node2.next;
            }
            temp = temp.next;
        }
        if (node1==null){
            temp.next = node1;
        }
        if (node2==null){
            temp.next = node2;
        }
        return newNode.next;
    }

8.[Hive]查询销售额排行的前、后3名产品ID

select
    pid,
    ROW_NUMBER() over (order by sum(amount)) rk1,
    ROW_NUMBER() over (order by sum(amount) desc) rk2
from
    pro
group by
    pid

9.[Hive]统计在线高峰期人数及时间段

with
t2 as 
(
    select  at_time
            ,lead(at_time, 1, at_time) over(order by at_time) lead_time
            ,sum(if(type = 'login', 1, -1)) over(order by at_time rows between unbounded preceding and current row) online_ct
    from    (
        select  id,start_time at_time,'login' type from    player
        union all 
        select  id,end_time at_time,'logout' type from    player
        order by at_time
    ) t1
)

select  at_time,lead_time,online_ct
from    t2
where   t2.online_ct in (select max(online_ct) max_ct from t2)

10.[Java]股票最大交易额

    public static int maxProfit(int[] prices) {
        //1. minPrice记录股票历史最低值
        int minPrice = prices[0];
        //2. maxProfit记录股票最大收益
        int maxProfit = 0;
        //3. 遍历股票prices，动态修改minPrice和maxProfit
        int buying = 0;
        // 买入时间
        int sale = 0;
        // 卖出时间
        int editIndex = 0;
        // 记录天数
        for (int price : prices) {
            editIndex+=1;
            //4. 如果当天的price小于minPrice，则修改minPrice；
            if (price < minPrice) {
                minPrice = price;
                buying = editIndex;
            }
            //5. 如果当天的收益大于maxProfit，则修改maxProfit
            int profit = price - minPrice;
            if (profit > maxProfit) {
                maxProfit = profit;
                sale = editIndex;
            }
        }
        return maxProfit;
    }

11.[Hive]连续重复数据

-- 加载数据前一行和后一行处理
with cal_lad_lead_tab as (
select `timestamp`,id,grp,lag(grp) OVER w as prev_grp,lead(grp) OVER w as next_grp  from df  WINDOW w AS ( ORDER BY Timestamp)
),
-- 计算相同数据
cal_alike as (
select Timestamp
       ,id
       ,grp
       ,sum(case 
        when (grp = prev_grp and grp = next_grp) then 0
        when (grp = prev_grp and grp != next_grp) then 1
        when (grp = prev_grp and (grp != next_grp or next_grp is null)) then 0 else 1
        over(order by `timestamp` rows between UNBOUNDED PRECEDING and CURRENT ROW)+1) as event_seq
       from cal_lad_lead_tab order by `timestamp`
),
-- 查询最终数据
select id,collect_list(grp) as cli_grp from cal_lad_lead_tab group by event_seq,id;

12.[Java]两个链表之和

public ListNode addTwoNumbers(ListNode l1, ListNode l2) {
        int carry = 0;//表示进位
        ListNode head = null, tail = null;

        while (l1 != null || l2 != null) {
            //如果其中有一个到达结尾了，那么这个链表这一位的的数字就为0。
            int x = l1 == null ? 0 : l1.val;
            int y = l2 == null ? 0 : l2.val;
            int sum = x + y + carry;

            if (head == null) {//第一次插入
                head = tail = new ListNode(sum % 10);
            } else {
                tail.next = new ListNode(sum % 10);
                tail = tail.next;
            }
            carry = sum / 10;
            if (l1 != null) {
                l1 = l1.next;
            }
            if (l2 != null) {
                l2 = l2.next;
            }
        }
        if (carry == 1) tail.next = new ListNode(carry);
        return head;
    }

13.[Java]最长连续序列

【动态规划】O(n) 复杂度。

用哈希表存储每个端点值对应连续区间的长度
若数已在哈希表中：跳过不做处理
若是新数加入：
- 取出其左右相邻数已有的连续区间长度 left 和 right
- 计算当前数的区间长度为：cur_length = left + right + 1
- 根据 cur_length 更新最大长度 max_length 的值
- 更新区间两端点的长度值

class Solution(object):
    def longestConsecutive(self, nums):
        hash_dict = dict()
        max_length = 0
        for num in nums:
            if num not in hash_dict:
                left = hash_dict.get(num - 1, 0)
                right = hash_dict.get(num + 1, 0)
                cur_length = 1 + left + right
                if cur_length > max_length:
                    max_length = cur_length
                hash_dict[num] = cur_length
                hash_dict[num - left] = cur_length
                hash_dict[num + right] = cur_length
        return max_length

14、[Hive]行转列列转行

有数据表t1有以下数据内容

user_name course score 
user1 语文 89
user1 数学 92
user1 英语 87
user2 语文 96
user2 数学 84
user2 英语 99

1.将course, score多行转一列

用户 课程 分数
user1 ｜ 语文，数学，英语 ｜ 89，92，87
user2 ｜ 语文，数学，英语 ｜ 89，92，87

2. 将上面的结果再还原回去

15、[Hive]连续三天登陆的用户

user_id,login_date
1,2021-10-01
2,2021-10-03
1,2021-10-02
1,2021-10-03
2,2021-10-06
2,2021-10-07
2,2021-10-08
3,2021-10-08

求连续3天登录的所有用户个数

16、[Hive]说出答案并说说为什么

表t1中字段a数据如下：

null
111
222
333

问题一：

select sum(a), avg(a), min(a), max(a), count(a), count(1), count(*) from t1;

问题二：

select 1 + null + 2;