安装

我写了个脚本放在 gist

可以执行以下命令来自动安装:

curl -L https://gist.githubusercontent.com/emacsist/ebb98839c814cdb2f16e20efcef9474e/raw/aa8e035ad2e61fa74f4c904a7953806c62155fa3/install-clickhouse.sh | bash

下面是脚本的完整内容:


#!/usr/bin/env bash

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

grep -q sse4_2 /proc/cpuinfo || exit 1

sudo echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" > /etc/apt/sources.list.d/clickhouse.list

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 -y

sudo apt-get update -y

sudo apt-get install clickhouse-client clickhouse-server -y

启动与关闭

sudo service clickhouse-server start

sudo service clickhouse-server stop

配置

服务配置文件在:

/etc/clickhouse-server/config.xml

用户配置文件在:

/etc/clickhouse-server/users.xml

修改数据存放目录

config.xml 中, 找到:

    <!-- Path to data directory, with trailing slash. -->
    <path>/home/username/clickhouse/</path>

    <!-- Path to temporary data for processing hard queries. -->
    <tmp_path>/home/username/clickhouse/tmp/</tmp_path>

这两个分别是数据库数据存放目录和临时文件的存放目录, 可根据自身需要修改. 默认情况下是在 /var/lib/clickhouse/

开启远程登录

修改 config.xml , 找到:

<listen_host></listen_host>

, 将它修改为:

<listen_host>::/0</listen_host>

也可以显式指定你想绑定的IP地址.

再修改 users.xml, 找到:

<networks incl="networks" replace="replace">
    <ip>::/0</ip>
</networks>

修改为以上内容

然后重启clickhouse , 就可以在其他地方, 通过 clickhouse-client -h IP 地址来访问了, 至于用户名和密码, 就是在 users.xml 中的 <users> 的子节点, 每一个子节点就是一个用户.

常用SQL

创建表

CREATE TABLE b6logs(
    eventDate Date,
	impid UInt64,
	uid String,
	idfa String,
	imei String
) ENGINE=MergeTree(eventDate, (impid, eventDate), 8192)

一般情况下, 都建议使用 MergeTree 引擎. 这个引擎必须要有一个 Date 的列来作为索引, 即上面的 eventDate.

导入CSV数据

cat xxx.csv | clickhouse-client --query="INSERT INTO b6logs FORMAT CSV";

指定分隔符

cat xxx.csv | clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO b6logs FORMAT CSV";

导入数据时忽略错误

clickhouse-client  --input_format_allow_errors_num=100000 --input_format_allow_errors_ratio=0.2 

--input_format_allow_errors_num : 是允许的错误数

--input_format_allow_errors_ratio : 是允许的错误率, 范围是 [0-1]

导出 CSV 数据

clickhouse-client --query="select uid, idfa, imei from (select impid, uid from b2logs where impid >= 15289903030261609347 and impid <= 15289904230261609347) any inner join (select impid, idfa, imei from b6logs where impid >= 15289903030261609347 and impid <= 15289904230261609347) using(impid) format CSV" > 9c9dc608-269b-4f02-b122-ef5dffb2669d.log

即语法为 select xxxx format CSV

重命名表

rename table tbl1 to btl2;

删除表

drop table tbl;

查看表结构

desc tbl;

更多语法, 参考官方文档. https://clickhouse.yandex/docs/en/query_language/queries/

MergeTree 引擎中删除分区

注意, 默认情况下 mergeTree 引擎是按月分区的, 删除分区的格式为 201808

如果想修改为按日分区, 则在建表时加上:

ENGINE = MergeTree PARTITION BY eventDate ORDER BY imp_id SETTINGS index_granularity = 8192;
然后就可以:
alter table xxx drop partition '2018-08-08';

默认情况下, Clickhouse 不允许删除分区或表的大小大于 50GB 的分区或表. 可以通过修改server的配置文件来永久配置. 也可以临时设置一下来删除而不用重启服务.

永久配置

sudo vim /etc/clickhouse-server/config.xml

然后注释掉下面两行

    <!-- <max_table_size_to_drop>0</max_table_size_to_drop> -->
    <!-- <max_partition_size_to_drop>0</max_partition_size_to_drop> -->

0表示不限制. 或者你可以设置为你想限制的最大的大小.    

临时设置

创建个标志文件:

sudo touch '/home/username/clickhouse/flags/force_drop_table' && sudo chmod 666 '/home/username/clickhouse/flags/force_drop_table'

创建好之后, 就可以执行上面的删除分区或表的命令了.

查看表大小

SELECT
    database,
    table,
    formatReadableSize(size) AS size,
    formatReadableSize(bytes_on_disk) AS bytes_on_disk,
    formatReadableSize(data_uncompressed_bytes) AS data_uncompressed_bytes,
    formatReadableSize(data_compressed_bytes) AS data_compressed_bytes,
    compress_rate,
    rows,
    days,
    formatReadableSize(avgDaySize) AS avgDaySize
FROM
(
    SELECT
        database,
        table,
        sum(bytes) AS size,
        sum(rows) AS rows,
        min(min_date) AS min_date,
        max(max_date) AS max_date,
        sum(bytes_on_disk) AS bytes_on_disk,
        sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
        sum(data_compressed_bytes) AS data_compressed_bytes,
        (data_compressed_bytes / data_uncompressed_bytes) * 100 AS compress_rate,
        max_date - min_date AS days,
        size / (max_date - min_date) AS avgDaySize
    FROM system.parts
    WHERE active
    GROUP BY
        database,
        table
    ORDER BY
        database ASC,
        size DESC
)

性能相关收集

join 表性能

切记, 要用大表 join 小表. (不知道具体为什么, 从经验上看, 用大表作为驱动表, 性能远远快于用小表作为驱动表). (MySQL 里的话, 则是小表驱动大表).

优化 distinct count

之前

select yob, count(), count(distinct uid, idfa, imei) from nginx_bid_log where eventDate='2018-9-1' group by yob;

之后

select yob, count(), count(distinct(sipHash64(concat(uid, idfa, imei)))) from nginx_bid_log where eventDate='2018-9-1' group by yob;

参考资料