9.3.2 空 key 过滤

有时 join 超时是因为某些 key 对应的数据太多,而相同 key 对应的数据都会发送到相同的 reducer上,从而导致内存不够。

此时我们应该仔细分析这些异常的 key,很多情况下,这些 key对应的数据是异常数据,我们需要在SQL语句中进行过滤。

例如key对应的字段为空,这种情况的值一般都是异常的值

测试如下:

步骤1: 建 3 张表: 原始数据表、空id表、合并后数据表

// 创建原始表
create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
// 创建空id表
create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
// 创建join后表的语句
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

步骤2: 加载数据

load data local inpath '/opt/module/datas/ori_nullid/ori' into table ori;
load data local inpath '/opt/module/datas/ori_nullid/nullid' into table nullidtable;

步骤3: 测试不过滤空 id 的 join 操作

insert overwrite table jointable 
    select n.* 
    from nullidtable n left join ori o 
    on n.id = o.id;

步骤4: 测试过滤空 id 的 join 操作

insert overwrite table jointable 
    select n.* 
    from (select * from nullidtable where id is not null ) n  left join ori o 
    on n.id = o.id;

Copyright © 尚硅谷大数据 2019 all right reserved,powered by Gitbook
该文件最后修订时间: 2018-11-20 18:14:22

results matching ""

    No results matching ""