安装

我写了个脚本放在 gist

可以执行以下命令来自动安装:

curl -L https://gist.githubusercontent.com/emacsist/ebb98839c814cdb2f16e20efcef9474e/raw/aa8e035ad2e61fa74f4c904a7953806c62155fa3/install-clickhouse.sh | bash

下面是脚本的完整内容:


#!/usr/bin/env bash

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

grep -q sse4_2 /proc/cpuinfo || exit 1

sudo echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" > /etc/apt/sources.list.d/clickhouse.list

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 -y

sudo apt-get update -y

sudo apt-get install clickhouse-client clickhouse-server -y

启动与关闭

sudo service clickhouse-server start

sudo service clickhouse-server stop

配置

服务配置文件在:

/etc/clickhouse-server/config.xml

用户配置文件在:

/etc/clickhouse-server/users.xml

修改数据存放目录

在 config.xml 中, 找到:

    <!-- Path to data directory, with trailing slash. -->
    <path>/home/username/clickhouse/</path>

    <!-- Path to temporary data for processing hard queries. -->
    <tmp_path>/home/username/clickhouse/tmp/</tmp_path>

这两个分别是数据库数据存放目录和临时文件的存放目录, 可根据自身需要修改. 默认情况下是在 /var/lib/clickhouse/

开启远程登录

修改 config.xml , 找到:

<listen_host></listen_host>

, 将它修改为:

<listen_host>::/0</listen_host>

也可以显式指定你想绑定的IP地址.

再修改 users.xml, 找到:

<networks incl="networks" replace="replace">
    <ip>::/0</ip>
</networks>

修改为以上内容

然后重启clickhouse , 就可以在其他地方, 通过 clickhouse-client -h IP 地址来访问了, 至于用户名和密码, 就是在 users.xml 中的 <users> 的子节点, 每一个子节点就是一个用户.

常用SQL

创建表

CREATE TABLE b6logs(
    eventDate Date,
	impid UInt64,
	uid String,
	idfa String,
	imei String
) ENGINE=MergeTree(eventDate, (impid, eventDate), 8192)

一般情况下, 都建议使用 MergeTree 引擎. 这个引擎必须要有一个 Date 的列来作为索引, 即上面的 eventDate.

导入CSV数据

cat xxx.csv | clickhouse-client --query="INSERT INTO b6logs FORMAT CSV";

指定分隔符

cat xxx.csv | clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO b6logs FORMAT CSV";

导入数据时忽略错误

clickhouse-client  --input_format_allow_errors_num=100000 --input_format_allow_errors_ratio=0.2

--input_format_allow_errors_num : 是允许的错误数

--input_format_allow_errors_ratio : 是允许的错误率, 范围是 [0-1]

导出 CSV 数据

clickhouse-client --query="select uid, idfa, imei from (select impid, uid from b2logs where impid >= 15289903030261609347 and impid <= 15289904230261609347) any inner join (select impid, idfa, imei from b6logs where impid >= 15289903030261609347 and impid <= 15289904230261609347) using(impid) format CSV" > 9c9dc608-269b-4f02-b122-ef5dffb2669d.log

即语法为 select xxxx format CSV

重命名表

rename table tbl1 to btl2;

删除表

drop table tbl;

添加列

alter table dsp_statis add column cost UInt32 default 0;

查看表结构

desc tbl;

更多语法, 参考官方文档. https://clickhouse.yandex/docs/en/query_language/queries/

MergeTree 引擎中删除分区

注意, 默认情况下 mergeTree 引擎是按月分区的, 删除分区的格式为 201808

如果想修改为按日分区, 则在建表时加上:

ENGINE = MergeTree PARTITION BY eventDate ORDER BY imp_id SETTINGS index_granularity = 8192;
然后就可以:
alter table xxx drop partition '2018-08-08';

默认情况下, Clickhouse 不允许删除分区或表的大小大于 50GB 的分区或表. 可以通过修改server的配置文件来永久配置. 也可以临时设置一下来删除而不用重启服务.

永久配置

sudo vim /etc/clickhouse-server/config.xml

然后注释掉下面两行

    <!-- <max_table_size_to_drop>0</max_table_size_to_drop> -->
    <!-- <max_partition_size_to_drop>0</max_partition_size_to_drop> -->

0表示不限制. 或者你可以设置为你想限制的最大的大小.

临时设置

创建个标志文件:

sudo touch '/home/username/clickhouse/flags/force_drop_table' && sudo chmod 666 '/home/username/clickhouse/flags/force_drop_table'

创建好之后, 就可以执行上面的删除分区或表的命令了.

查看表大小

SELECT
    database,
    table,
    formatReadableSize(size) AS size,
    formatReadableSize(bytes_on_disk) AS bytes_on_disk,
    formatReadableSize(data_uncompressed_bytes) AS data_uncompressed_bytes,
    formatReadableSize(data_compressed_bytes) AS data_compressed_bytes,
    compress_rate,
    rows,
    days,
    formatReadableSize(avgDaySize) AS avgDaySize
FROM
(
    SELECT
        database,
        table,
        sum(bytes) AS size,
        sum(rows) AS rows,
        min(min_date) AS min_date,
        max(max_date) AS max_date,
        sum(bytes_on_disk) AS bytes_on_disk,
        sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
        sum(data_compressed_bytes) AS data_compressed_bytes,
        (data_compressed_bytes / data_uncompressed_bytes) * 100 AS compress_rate,
        max_date - min_date AS days,
        size / (max_date - min_date) AS avgDaySize
    FROM system.parts
    WHERE active and database = currentDatabase()
    GROUP BY
        database,
        table
    ORDER BY
        database ASC,
        size DESC
)

跟踪分区

SELECT
    database,
    table,
    count() AS parts,
    uniq(partition) AS partitions,
    sum(marks) AS marks,
    sum(rows) AS rows,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100., 2) AS percentage
FROM system.parts
WHERE active AND (database = currentDatabase())
GROUP BY
    database,
    table

检查数据大小

SELECT
    table,
    formatReadableSize(sum(data_compressed_bytes)) AS tc,
    formatReadableSize(sum(data_uncompressed_bytes)) AS tu,
    round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100, 2) AS ratio
FROM system.columns
WHERE database = currentDatabase()
GROUP BY table
ORDER BY sum(data_compressed_bytes) ASC

查看表中列的数据大小

SELECT
    column,
    any(type),
    sum(column_data_compressed_bytes) AS compressed,
    sum(column_data_uncompressed_bytes) AS uncompressed,
    sum(rows)
FROM system.parts_columns
WHERE (table = 'daily_uuid') AND active
GROUP BY column
ORDER BY column ASC

执行 SQL 文件

clickhouse-client -d 数据库 --multiquery < /tmp/your.sql.file

查看分区信息

select partition, name, active from system.parts WHERE table = 'visits'

性能相关收集

昨天新增UV

比如, 当天的 uid 相比之前的 uid , 新增多少 UV

交集 = clickhouse-client --query="select count(distinct uid) from all_uid where uid in (select uid from uniq_uid where eventDate = yesterday())

当天所有集 = clickhouse-client --query="select count(distinct uid) from uniq_uid where eventDate = yesterday()"

新增 UV =当天所有集 - 交集

如果单单是直接按下面写, 容易出现 OOM

clickhouse-client --query="select count(distinct uid) from uniq_uid where eventDate = yesterday() and uid not in (select uid from all_uid)"

join 表性能

切记, 要用大表 join 小表. (不知道具体为什么, 从经验上看, 用大表作为驱动表, 性能远远快于用小表作为驱动表). (MySQL 里的话, 则是小表驱动大表).

优化 distinct count

之前

select yob, count(), count(distinct uid, idfa, imei) from nginx_bid_log where eventDate='2018-9-1' group by yob;

之后

select yob, count(), count(distinct(sipHash64(concat(uid, idfa, imei)))) from nginx_bid_log where eventDate='2018-9-1' group by yob;

查看数据分布

select histogram(100)(upstream_resp_time) from (select upstream_resp_time from nginx_bid_log where eventDate = '2018-12-13') format CSV;

histogram(100) 表示组距100 (即分成100等份的的分布) , 后面的 upstream_resp_time 是你的列名, 即按这个列的数据来进行统计.

bar

select upstream_resp_time, bar(列名, 最小值, 最大, step) from tableXX;

显示简单的图形.

hex 十六进制转换为十进制

SELECT reinterpretAsInt64(reverse(unhex('123')));

md5 分区

# 一
SELECT reinterpretAsInt64(reverse(unhex(substring(md5_field, 1, 1))));

# 二, md5 => hex => 十进制  => 取模
SELECT modulo(reinterpretAsInt64(reverse(unhex(substring(md5_field, 1, 1)))), 5);

跟踪日志

clickhouse-client --send_logs_level=trace

# 必须在 config.xml 中开启
select * from system.text_log

# 服务器日志
/var/log/clickhouse-server/clickhouse-server.log

开启多线程加速查询

默认是当前物理 CPU 核心的一半

set max_threads=2
set max_threads=4

SELECT * FROM system.settings where name like '%threads%';

prewhere

当用它的时候, 只有 prewhere 需要的列数据才会被读取. 其他的列, 根据查询按需读取
一般用于某非索引列, 但提供了很强的过虑条件
它的优先级高于 where
相关的设置 optimize_move_to_prewhere : 如果设置为 1 并且没 prewhere , 系统则自动判断用 prewhere (默认是 1)

join

在 join 时要将数据过虑再 join, 以减少数据扫描

分区数

保持在几百内, 而不是几千

`index_granularity`

clickhouse 是按粒度(granule) 来读取数据的. 表引擎的这个参数表示一个粒度里, 索引了多少行数据.

默认是 8192

order by 建表子句

选择适当的 order by 列是性能的关键

更高的压缩
更好的索引选择
更好的 prehwere 性能

encoding

https://www.altinity.com/blog/2019/7/new-encodings-to-improve-clickhouse

以减少数据大小. 默认情况下是 lz4

CREATE TABLE codec_example
(
    dt Date CODEC(ZSTD),
    ts DateTime CODEC(LZ4HC),
    float_value Float32 CODEC(NONE),
    double_value Float64 CODEC(LZ4HC(9))
    value Float32 CODEC(Delta, ZSTD)
)
ENGINE

LowCardinality : 小于 10K 的 String. https://www.altinity.com/blog/2019/3/27/low-cardinality
Delta : 时序类型
DoubleDelta : 递增计数
Gorilla : Guage 数据
T64 : 整型而非随机 hash

使用物化视图

CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...

ReplacingMergeTree 去重数据表引擎

CREATE TABLE bid_user(
    uuid String,
    yob Int32,
    gender Int8 DEFAULT -1
) ENGINE = ReplacingMergeTree()

PARTITION by substring(uuid, 1, 1)
order by (yob, gender)

注意, 它去重的是在相同的分区键(partiton by)中, 相同的 order by 的数据

它是后台自动通过执行任务来处理的
它不保证当前版本的数据是实时去重的
可通过 optimize table bid_user [final]; 来建议 clickhouse 现在执行去重任务(但不保证)

杂项

optimze

set optimize_throw_if_noop = 1;
OPTIMIZE TABLE T1 FINAL;

如果 optimize 没操作, 则抛出原因.

kill query

SELECT query_id, query FROM system.processes;
KILL QUERY WHERE query_id = '<id>';

建表参数说明

ENGINE = MergeTree PARTITION BY eventDate ORDER BY imp_id SETTINGS index_granularity = 8192;

partition by : 表示按什么分区
order by : 表示在每一个分区中如何索引和排序数据

Contents

安装