<-- home

如何加速你的 count 查询

2019/07/28

背景
问题
解决
延伸

背景

项目中，经常会有一些统计类的查询，大概模式是:

select count(*) from table_name where ...

如何加速此类查询，大部分人首先想到的办法可能就是，「根据 where 条件」给表加索引。
那么如果是整表 count 呢，还有提高速度的办法吗？答案是，有的。

问题

线上有一张 news 表，会有爬虫不断的向该表插入爬取的新闻数据（所以下文中每次count的结果总不相同是因为数据确实增加了）

create table news
(
    id          int auto_increment primary key,
    type        varchar(20) default ''                not null comment '新闻分类',
    title       text                                  null,
    description text                                  null,
    content     mediumtext                            null,
    pics        json                                  null,
    source      varchar(255)                          null,
    create_time timestamp   default CURRENT_TIMESTAMP null
) collate = utf8mb4_unicode_ci;

直接执行

mysql> select count(*) from news;
+----------+
| count(*) |
+----------+
|   678532 |
+----------+
1 row in set (1 min 3.57 sec)

mysql> select count(*) from news;
+----------+
| count(*) |
+----------+
|   678538 |
+----------+
1 row in set (50.18 sec)

可以看到，执行该 SQL 非常的耗时，我们查看一下执行计划：

mysql> explain select count(*) from news \G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: news
   partitions: NULL
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 440632
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

执行计划说，该查询使用了主键索引，由于是主键索引，该查询自然而然的也是覆盖索引，并且 filtered 也是 100.00。那么如何对这句SQL进行加速呢？

解决

由于 count(*) 类型的查询不需要去任何一列的数据，主键索引(clustered index)的叶子节点存储了完整的行数据（上例中甚至还有很多的 blob 类字段，数据页都存储不下了），遍历主键索引会带来较大的磁盘 IO，所以解决办法就呼之欲出了———随便找个列，建一个小巧的辅助索引。

我们选择 source 列来建索引。

create index news_source_index on news (source);

索引建好后，我们查看执行计划：

mysql> explain select count(*) from news \G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: news
   partitions: NULL
         type: index
possible_keys: NULL
          key: news_source_index
      key_len: 1023
          ref: NULL
         rows: 440697
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.02 sec)

根据 Extra 字段，你会发现，该查询仍然是覆盖索引，但是 InnoDB 选择了使用我们刚刚创建的辅助索引，这是因为，辅助索引远小于主键索引，遍历辅助索引的代价较小一些。再一次执行查询：

mysql> select count(*) from news;
+----------+
| count(*) |
+----------+
|   678129 |
+----------+
1 row in set (1.32 sec)

mysql> select count(*) from news;
+----------+
| count(*) |
+----------+
|   678129 |
+----------+
1 row in set (0.09 sec)

连续两次执行该SQL的时间分别从 1 min 3.57 sec、50.18 sec 变成了 1.32 sec、0.09 sec

延伸

上文讲解的例子是一个没有 where 条件的 SQL，提高 count(*) 速度的方法是随便建一个代价较小的辅助索引。如果是带有查询条件的查询呢？我们是不是一定要建立精准命中查询条件的索引呢？答案是否定的。

对于 count 型查询，我们不一定需要一个完美切合查询条件的索引。

下面用一个例子来说明，首先还是举一个创建精确索引的例子。

还是上面的 news 表，我们想执行一个范围查询：

mysql> select count(*) from news where create_time > curdate();
+----------+
| count(*) |
+----------+
|      996 |
+----------+
1 row in set (1 min 44.76 sec)

该查询用不到任何索引，所以速度非常慢，因此，我们首先想到的是，给 create_time 字段建立索引。

mysql> create index news_create_time_index on news (create_time);
Query OK, 0 rows affected (1 min 16.94 sec)
Records: 0  Duplicates: 0  Warnings: 0

索引建好后，我们查看执行计划，可以用上这个索引，并且查询时间已经可以忽略不计了：

mysql> explain select count(*) from news where create_time > curdate() \G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: news
   partitions: NULL
         type: range
possible_keys: news_create_time_index
          key: news_create_time_index
      key_len: 5
          ref: NULL
         rows: 1104
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

mysql> select count(*) from news where create_time > curdate();
+----------+
| count(*) |
+----------+
|     1104 |
+----------+
1 row in set (0.01 sec)

（注意下面的转折）

然而，假设如果这张表已经有了包含 create_time 的联合索引，并且 create_time 不是联合索引的第一列，那么 InnoDB 可否使用那个联合索引来优化这个查询呢？

比如假设我们已经有了(source, create_time) 的联合索引，仅仅看 where 条件，是对 create_time 进行范围查找，显然是用不上这个联合索引的，因为联合索引只有第一列是有序的，其它列只相对于前一列保持有序。下面我们来实验一下，看看实际情况如何。

首先我们删除已经存在的 create_time 的索引，然后建立一个 (source, create_time) 的联合索引：

mysql> create index news_source_create_time_index on news (source, create_time);
Query OK, 0 rows affected (40.95 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> drop index news_create_time_index on news;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

(吃了个中饭回来)
然后我们看一看这条查询的执行计划：

mysql> explain select count(*) from news where create_time > curdate() \G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: news
   partitions: NULL
         type: index
possible_keys: news_source_create_time_index
          key: news_source_create_time_index
      key_len: 1028
          ref: NULL
         rows: 441733
     filtered: 33.33
        Extra: Using where; Using index
1 row in set, 1 warning (0.03 sec)

根据执行计划我们发现，对 create_time 的范围查询，竟然可以用到 (source, create_time) 的联合索引，实际执行该查询：

mysql> select count(*) from news where create_time > curdate();
+----------+
| count(*) |
+----------+
|     1173 |
+----------+
1 row in set (0.27 sec)

查询仍然非常的高效，这是为什么呢？

原因在于，尽管 create_time 是无序的，但是对于 count(*) 类型的查询，我们需要使用到的字段其实只有 create_time 而已，而辅助索引 (source, create_time) 的 B+ 树中，已经包含了所有的 create_time 信息，可以「覆盖」我们的查询条件，足够我们对「记录数」进行统计，因此，不需要去遍历巨大的主键索引的 B+ 树。然而，由于 create_time 是无序的，我们仍然需要对所有的索引进行排序之后再筛选出满足条件的记录，所以 Extra 中包含了 Using where，但这比直接在主键索引中进行筛选，已经减少了很多的磁盘IO，所以仍然保持了查询性能。