ClickHouse入门记录
Contents
安装
我写了个脚本放在 gist
可以执行以下命令来自动安装:
curl -L https://gist.githubusercontent.com/emacsist/ebb98839c814cdb2f16e20efcef9474e/raw/aa8e035ad2e61fa74f4c904a7953806c62155fa3/install-clickhouse.sh | bash
下面是脚本的完整内容:
#!/usr/bin/env bash
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
grep -q sse4_2 /proc/cpuinfo || exit 1
sudo echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" > /etc/apt/sources.list.d/clickhouse.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 -y
sudo apt-get update -y
sudo apt-get install clickhouse-client clickhouse-server -y
启动与关闭
sudo service clickhouse-server start
sudo service clickhouse-server stop
配置
服务配置文件在:
/etc/clickhouse-server/config.xml
用户配置文件在:
/etc/clickhouse-server/users.xml
修改数据存放目录
在 config.xml
中, 找到:
<!-- Path to data directory, with trailing slash. -->
<path>/home/username/clickhouse/</path>
<!-- Path to temporary data for processing hard queries. -->
<tmp_path>/home/username/clickhouse/tmp/</tmp_path>
这两个分别是数据库数据存放目录和临时文件的存放目录, 可根据自身需要修改. 默认情况下是在 /var/lib/clickhouse/
开启远程登录
修改 config.xml
, 找到:
<listen_host></listen_host>
, 将它修改为:
<listen_host>::/0</listen_host>
也可以显式指定你想绑定的IP地址.
再修改 users.xml
, 找到:
<networks incl="networks" replace="replace">
<ip>::/0</ip>
</networks>
修改为以上内容
然后重启clickhouse , 就可以在其他地方, 通过 clickhouse-client -h IP 地址来访问了
, 至于用户名和密码, 就是在 users.xml
中的 <users>
的子节点, 每一个子节点就是一个用户.
常用SQL
创建表
CREATE TABLE b6logs(
eventDate Date,
impid UInt64,
uid String,
idfa String,
imei String
) ENGINE=MergeTree(eventDate, (impid, eventDate), 8192)
一般情况下, 都建议使用 MergeTree
引擎. 这个引擎必须要有一个 Date
的列来作为索引, 即上面的 eventDate
.
导入CSV数据
cat xxx.csv | clickhouse-client --query="INSERT INTO b6logs FORMAT CSV";
指定分隔符
cat xxx.csv | clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO b6logs FORMAT CSV";
导入数据时忽略错误
clickhouse-client --input_format_allow_errors_num=100000 --input_format_allow_errors_ratio=0.2
--input_format_allow_errors_num
: 是允许的错误数
--input_format_allow_errors_ratio
: 是允许的错误率, 范围是 [0-1]
导出 CSV 数据
clickhouse-client --query="select uid, idfa, imei from (select impid, uid from b2logs where impid >= 15289903030261609347 and impid <= 15289904230261609347) any inner join (select impid, idfa, imei from b6logs where impid >= 15289903030261609347 and impid <= 15289904230261609347) using(impid) format CSV" > 9c9dc608-269b-4f02-b122-ef5dffb2669d.log
即语法为 select xxxx format CSV
重命名表
rename table tbl1 to btl2;
删除表
drop table tbl;
添加列
alter table dsp_statis add column cost UInt32 default 0;
查看表结构
desc tbl;
更多语法, 参考官方文档. https://clickhouse.yandex/docs/en/query_language/queries/
MergeTree 引擎中删除分区
注意, 默认情况下 mergeTree 引擎是按月分区的, 删除分区的格式为
201808
如果想修改为按日分区, 则在建表时加上:
ENGINE = MergeTree PARTITION BY eventDate ORDER BY imp_id SETTINGS index_granularity = 8192;
然后就可以:
alter table xxx drop partition '2018-08-08';
默认情况下, Clickhouse 不允许删除分区或表的大小大于 50GB 的分区或表. 可以通过修改server的配置文件来永久配置. 也可以临时设置一下来删除而不用重启服务.
永久配置
sudo vim /etc/clickhouse-server/config.xml
然后注释掉下面两行
<!-- <max_table_size_to_drop>0</max_table_size_to_drop> -->
<!-- <max_partition_size_to_drop>0</max_partition_size_to_drop> -->
0表示不限制. 或者你可以设置为你想限制的最大的大小.
临时设置
创建个标志文件:
sudo touch '/home/username/clickhouse/flags/force_drop_table' && sudo chmod 666 '/home/username/clickhouse/flags/force_drop_table'
创建好之后, 就可以执行上面的删除分区或表的命令了.
查看表大小
SELECT
database,
table,
formatReadableSize(size) AS size,
formatReadableSize(bytes_on_disk) AS bytes_on_disk,
formatReadableSize(data_uncompressed_bytes) AS data_uncompressed_bytes,
formatReadableSize(data_compressed_bytes) AS data_compressed_bytes,
compress_rate,
rows,
days,
formatReadableSize(avgDaySize) AS avgDaySize
FROM
(
SELECT
database,
table,
sum(bytes) AS size,
sum(rows) AS rows,
min(min_date) AS min_date,
max(max_date) AS max_date,
sum(bytes_on_disk) AS bytes_on_disk,
sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
sum(data_compressed_bytes) AS data_compressed_bytes,
(data_compressed_bytes / data_uncompressed_bytes) * 100 AS compress_rate,
max_date - min_date AS days,
size / (max_date - min_date) AS avgDaySize
FROM system.parts
WHERE active and database = currentDatabase()
GROUP BY
database,
table
ORDER BY
database ASC,
size DESC
)
跟踪分区
SELECT
database,
table,
count() AS parts,
uniq(partition) AS partitions,
sum(marks) AS marks,
sum(rows) AS rows,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100., 2) AS percentage
FROM system.parts
WHERE active AND (database = currentDatabase())
GROUP BY
database,
table
检查数据大小
SELECT
table,
formatReadableSize(sum(data_compressed_bytes)) AS tc,
formatReadableSize(sum(data_uncompressed_bytes)) AS tu,
round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100, 2) AS ratio
FROM system.columns
WHERE database = currentDatabase()
GROUP BY table
ORDER BY sum(data_compressed_bytes) ASC
查看表中列的数据大小
SELECT
column,
any(type),
sum(column_data_compressed_bytes) AS compressed,
sum(column_data_uncompressed_bytes) AS uncompressed,
sum(rows)
FROM system.parts_columns
WHERE (table = 'daily_uuid') AND active
GROUP BY column
ORDER BY column ASC
执行 SQL 文件
clickhouse-client -d 数据库 --multiquery < /tmp/your.sql.file
查看分区信息
select partition, name, active from system.parts WHERE table = 'visits'
性能相关收集
昨天新增UV
比如, 当天的 uid 相比之前的 uid , 新增多少 UV
交集 = clickhouse-client --query="select count(distinct uid) from all_uid where uid in (select uid from uniq_uid where eventDate = yesterday())
当天所有集 = clickhouse-client --query="select count(distinct uid) from uniq_uid where eventDate = yesterday()"
新增 UV =当天所有集 - 交集
如果单单是直接按下面写, 容易出现 OOM
clickhouse-client --query="select count(distinct uid) from uniq_uid where eventDate = yesterday() and uid not in (select uid from all_uid)"
join 表性能
切记, 要用大表 join 小表. (不知道具体为什么, 从经验上看, 用大表作为驱动表, 性能远远快于用小表作为驱动表). (MySQL 里的话, 则是小表驱动大表).
优化 distinct count
之前
select yob, count(), count(distinct uid, idfa, imei) from nginx_bid_log where eventDate='2018-9-1' group by yob;
之后
select yob, count(), count(distinct(sipHash64(concat(uid, idfa, imei)))) from nginx_bid_log where eventDate='2018-9-1' group by yob;
查看数据分布
select histogram(100)(upstream_resp_time) from (select upstream_resp_time from nginx_bid_log where eventDate = '2018-12-13') format CSV;
histogram(100) 表示组距100 (即分成100等份的的分布) , 后面的
upstream_resp_time
是你的列名, 即按这个列的数据来进行统计.
bar
select upstream_resp_time, bar(列名, 最小值, 最大, step) from tableXX;
显示简单的图形.
hex 十六进制 转换为 十进制
SELECT reinterpretAsInt64(reverse(unhex('123')));
md5 分区
# 一
SELECT reinterpretAsInt64(reverse(unhex(substring(md5_field, 1, 1))));
# 二, md5 => hex => 十进制 => 取模
SELECT modulo(reinterpretAsInt64(reverse(unhex(substring(md5_field, 1, 1)))), 5);
跟踪日志
clickhouse-client --send_logs_level=trace
# 必须在 config.xml 中开启
select * from system.text_log
# 服务器日志
/var/log/clickhouse-server/clickhouse-server.log
开启多线程加速查询
默认是当前物理 CPU 核心的一半
set max_threads=2
set max_threads=4
SELECT * FROM system.settings where name like '%threads%';
prewhere
- 当用它的时候, 只有 prewhere 需要的列数据才会被读取. 其他的列, 根据查询按需读取
- 一般用于某非索引列, 但提供了很强的过虑条件
- 它的优先级高于 where
- 相关的设置
optimize_move_to_prewhere
: 如果设置为 1 并且没prewhere
, 系统则自动判断用prewhere
(默认是 1)
join
在 join 时要将数据过虑再 join, 以减少数据扫描
分区数
保持在几百内, 而不是几千
index_granularity
clickhouse 是按粒度(granule
) 来读取数据的. 表引擎的这个参数表示一个粒度里, 索引了多少行数据.
默认是 8192
order by 建表子句
选择适当的 order by 列是性能的关键
- 更高的压缩
- 更好的索引选择
- 更好的
prehwere
性能
encoding
https://www.altinity.com/blog/2019/7/new-encodings-to-improve-clickhouse
以减少数据大小. 默认情况下是 lz4
CREATE TABLE codec_example
(
dt Date CODEC(ZSTD),
ts DateTime CODEC(LZ4HC),
float_value Float32 CODEC(NONE),
double_value Float64 CODEC(LZ4HC(9))
value Float32 CODEC(Delta, ZSTD)
)
ENGINE
LowCardinality
: 小于 10K 的 String. https://www.altinity.com/blog/2019/3/27/low-cardinalityDelta
: 时序类型DoubleDelta
: 递增计数Gorilla
: Guage 数据T64
: 整型而非随机 hash
使用物化视图
CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
ReplacingMergeTree 去重数据表引擎
CREATE TABLE bid_user(
uuid String,
yob Int32,
gender Int8 DEFAULT -1
) ENGINE = ReplacingMergeTree()
PARTITION by substring(uuid, 1, 1)
order by (yob, gender)
注意, 它去重的是在相同的分区键(partiton by
)中, 相同的 order by
的数据
- 它是后台自动通过执行任务来处理的
- 它不保证当前版本的数据是实时去重的
- 可通过
optimize table bid_user [final];
来建议 clickhouse 现在执行去重任务(但不保证)
杂项
optimze
set optimize_throw_if_noop = 1;
OPTIMIZE TABLE T1 FINAL;
如果 optimize 没操作, 则抛出原因.
kill query
SELECT query_id, query FROM system.processes;
KILL QUERY WHERE query_id = '<id>';
建表参数说明
ENGINE = MergeTree PARTITION BY eventDate ORDER BY imp_id SETTINGS index_granularity = 8192;
partition by : 表示按什么分区
order by : 表示在每一个分区中如何索引和排序数据
参考资料
- http://www.clickhouse.com.cn/topic/5ad9521b9d28dfde2ddc5f86
- https://clickhouse.yandex/docs/en/query_language/queries/
- https://github.com/yandex/ClickHouse/blob/master/dbms/tests/queries/0_stateless/00418_input_format_allow_errors.sh
- https://groups.google.com/forum/m/#!topic/clickhouse/U_Uc4omvoqc
- https://www.slideshare.net/Altinity/clickhouse-query-performance-tips-and-tricks-by-robert-hodges-altinity-ceo
- https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324
- https://www.altinity.com/blog/2019/7/new-encodings-to-improve-clickhouse
- https://github.com/ClickHouse/ClickHouse/issues/4796