8.5 主流文件格式对比实验

从存储文件的压缩比查询速度两个角度对比

步骤1: 数据准备

文本文件内有10万条数据.

在存入到表之前文件大小(19M 左右):


步骤2: 创建表, 存储格式是TEXTFILE

create table log_text (
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
)
row format delimited fields terminated by '\t'
stored as textfile ;

向表中导入数据:

load data local inpath '/opt/module/datas/log.data' into table log_text ;

可见文件的大小和原始文件相比没有任何的变化.


步骤3: 创建表, 存储格式ORC

create table log_orc(
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
)
row format delimited fields terminated by '\t'
stored as orc ;

向表中插入数据:

insert into table log_orc select * from log_text ;

你会发现在插入数据的时候会开启 MapReduce 程序.

数据大小:


步骤4: 创建表, 存储格式是parquet

create table log_parquet(
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;

向表中加载数据:

insert into table log_parquet select * from log_text ;

存储文件压缩包总结: ORC > Parquet > textFile


步骤5: 存储文件的查询速度测试:

select count(*) from log_text;

Copyright © 尚硅谷大数据 2019 all right reserved,powered by Gitbook
该文件最后修订时间: 2018-11-20 18:14:22

results matching ""

    No results matching ""