关于索引,那些你可能不知道的事
关于索引,那些你可能不知道的事
Things You Didn't Know About Indexes

读取变快,写入变慢 / Reads get faster, writes get slower
Let's start with things you probably did know.
让我们从你可能确实知道的事情开始。
A database index is similar to the index you found at the back of your science textbooks in school. You want to find the pages that talk about Phosphorus? Head to the back and you'll find an alphabetical list of topics alongside page numbers that reference them.
数据库索引类似于你在学校科学教科书背面找到的索引。你想找到谈论磷的页面?翻到背面,你会找到一个按字母顺序排列的主题列表,旁边有引用它们的页码。
Oxygen ................. 42, 88, 103
Periodic table .......... 12–15
Phosphorus .............. 67, 91
Photosynthesis .......... 54, 72, 110
Potassium ............... 33, 78
Let's imagine we have a database table of Pokémon, like so:
假设我们有一个 Pokémon 数据库表,像这样:
id | name | type_1 | type_2 | generation | is_legendary | base_attack
----+-----------+----------+---------+------------+--------------+------------
1 | Bulbasaur | Grass | Poison | 1 | false | 49
4 | Charmander| Fire | NULL | 1 | false | 52
...
25 | Pikachu | Electric | NULL | 1 | false | 55
...
150| Mewtwo | Psychic | NULL | 1 | true | 110
Without an index, finding Pikachu means the database reads every row, one by one, checking the name column for each row. On four rows, that would be nearly instant. On ten million, it could be problematic. That kind of read is called a full table scan, and it's about as fast as it sounds (not very).
没有索引,找到皮卡丘意味着数据库逐行读取每一行,检查每行的 name 列。在四行上,这几乎是瞬间的。在一千万行上,这可能就有问题了。这种读取被称为全表扫描,速度和听起来一样快(不太快)。
But, if we add an index on name we get something that conceptually looks a bit like our book index above:
但是,如果我们在 name 上添加索引,我们会得到一些概念上看起来像上面的书籍索引的东西:
name → row
------------------
Bulbasaur → 1
Charmander → 4
Mewtwo → 150
Pikachu → 25
It's sorted, and so now our database can binary-search for the name and find the given row, rather than scanning the entire table. Postgres stores this as a B-tree under the hood, but the idea is the same as the textbook: sorted data you can search quickly.
它是排序的,所以现在我们的数据库可以二分查找这个名字并找到给定的行,而不是扫描整个表。Postgres 在底层将其存储为 B-tree,但想法和教科书一样:你可以快速搜索的排序数据。
So, let's index all the things, right?
那么,让我们把所有东西都加上索引,对吧?
The Cost of Indexing
索引的代价
As tempting as it may be to index everything, now we've seen what they can do for our queries, there are trade-offs to consider. With indexes, always keep in mind:
尽管看到索引能为查询做什么后很想给所有东西都加上索引,但有一些权衡需要考虑。使用索引时,始终记住:
Reads get faster, writes get slower.
读取变快,写入变慢。
With our shiny new index, every INSERT, UPDATE or DELETE now has to update the index too. When we add a new Pokémon, the database has to find the right spot in the sorted name index and slot it in. And we can have multiple indexes, so you can multiply that by each one.
有了我们闪亮的新索引,每个 INSERT、UPDATE 或 DELETE 现在也必须更新索引。当我们添加一个新的宝可梦时,数据库必须在排序的名称索引中找到正确的位置并将其插入。我们可以有多个索引,所以你可以乘以每一个。
Additionally, indexes are real data structures and we must store them. They live on disk and are pulled into memory. A table with eight indexes has nine things to keep warm in cache, not just one.
此外,索引是真实的数据结构,我们必须存储它们。它们存在于磁盘上并被拉入内存。一个有八个索引的表需要在缓存中保持温暖的东西有九个,而不只是一个。
And let's not forget the query planner. When you run a query, the planner's job is to pick the cheapest way to answer it. The more indexes you have, the more options it weighs, so planning time grows and could even exceed execution time on fast lookups.
让我们不要忘记查询规划器。当你运行查询时,规划器的工作是选择最便宜的方式来回答它。你拥有的索引越多,它权衡的选项就越多,所以规划时间会增长,甚至在快速查找时可能超过执行时间。
Why Your Index Isn't Working
为什么你的索引不起作用
You've got as far as weighing the trade-offs and deciding an index is right for the job. Excellent. But it's not working as you'd hoped. You're not seeing an improvement, or worse, you're seeing speed diminish.
你已经权衡了权衡并决定索引适合这项工作。太好了。但它没有像你希望的那样工作。你没有看到改善,或者更糟,你看到速度下降。
Let's cover some common gotchas.
让我们介绍一些常见的陷阱。
Composite Indexes Care About Order
复合索引关心顺序
Looking back at our Pokémon table, you may have decided a good index is one on type_1 and type_2. After all, "show me all the Pokémon that are Water and Flying types" is a perfectly reasonable query.
回顾我们的宝可梦表,你可能决定一个好的索引是 type_1 和 type_2 上的索引。毕竟,"显示所有是水和飞行类型的宝可梦"是一个完全合理的查询。
CREATE INDEX ON pokemon (type_1, type_2);
This index will definitely help with queries like these:
这个索引肯定会帮助这样的查询:
SELECT * FROM pokemon WHERE type_1 = 'Water';
SELECT * FROM pokemon WHERE type_1 = 'Water' AND type_2 = 'Flying';
But this one? Not so much:
但是这一个?就不太多了:
SELECT * FROM pokemon WHERE type_2 = 'Flying';
Do you find that surprising?
你觉得这令人惊讶吗?
When you create a composite index (type_1, type_2) you are asking the database to create a structure that looks something like this:
当你创建一个复合索引 (type_1, type_2) 时,你要求数据库创建一个看起来像这样的结构:
Grass → Poison → [Bulbasaur, Oddish, ...]
→ ...
Fire → NULL → [Charmander, Vulpix, ...]
→ Flying → [Charizard, Moltres, ...]
→ ...
Water → NULL → [Squirtle, Psyduck, ...]
→ Flying → [Wingull, Pelipper, ...]
→ ...
It sorts first by type_1, and then by type_2 within each group. That makes the index great for queries on type_1, and even better for queries on type_1 AND type_2 together, but won't be used the way you hope for queries on type_2 alone. Look at Flying: it's scattered under Grass, Fire, Water with no single place for the database to jump to.
它首先按 type_1 排序,然后在每个组内按 type_2 排序。这使得索引对于 type_1 上的查询非常好,对于 type_1 AND type_2 一起的查询甚至更好,但对于单独在 type_2 上的查询不会像你希望的那样使用。看看 Flying:它分散在 Grass、Fire、Water 下,数据库没有单一的地方可以跳转。
What we must ask ourselves is: What are the common queries going to be? If we will query on type_2 alone as frequently as type_1, then we should create a second index on type_2.
我们必须问自己的是:常见的查询会是什么?如果我们会像查询 type_1 一样频繁地单独查询 type_2,那么我们应该在 type_2 上创建第二个索引。
Functions Defeat Your Index
函数会击败你的索引
Case-insensitive search is a feature we often add without thinking. Users don't care whether they type "Pikachu", "pikachu" or "PiKaChU" - they just want the result. So we might reach for something like this:
不区分大小写的搜索是我们经常不假思索就添加的功能。用户不在乎他们输入的是"Pikachu"、"pikachu"还是"PiKaChU"——他们只想要结果。所以我们可能会使用这样的东西:
SELECT * FROM pokemon WHERE lower(name) = 'pikachu';
I don't know about you, but I wouldn't look twice at a query like that. It seems absolutely fine. We've got our index on name from earlier, so this query should fly. But, of course, it doesn't.
我不知道你怎么想,但我不会再看这样的查询。它看起来绝对没问题。我们从之前就有 name 上的索引,所以这个查询应该飞起来。但是,当然,它没有。
Why? The index is on name, not lower(name).
为什么?索引在 name 上,而不是 lower(name) 上。
As far as the database is concerned, those are two completely different things. Your index is a sorted list of values like Bulbasaur, Charmander, Squirtle, Pikachu not bulbasaur, charmander, squirtle, pikachu. So when you ask for rows where lower(name) = 'pikachu', the database has no sorted structure to consult and falls back to scanning the whole table, lowercasing each name as it goes.
就数据库而言,这是两个完全不同的东西。你的索引是一个排序的值列表,如 Bulbasaur、Charmander、Squirtle、Pikachu,而不是 bulbasaur、charmander、squirtle、pikachu。所以当你请求 lower(name) = 'pikachu' 的行时,数据库没有可咨询的排序结构,只能回退到扫描整个表,一边走一边将每个名称小写。
This applies to any function wrapping the column. If the database can't see the raw indexed column on the left side of the comparison, the index is off the table.
这适用于包装列的任何函数。如果数据库在比较的左侧看不到原始索引列,索引就不在考虑范围内。
And here's a real gotcha: Implicit conversions count too. When you compare some text against an integer, or a timestamp against a string, it can trigger a silent conversion that has the same effect as wrapping a column in a function.
这里有一个真正的陷阱:隐式转换也算。当你将一些文本与整数进行比较,或将时间戳与字符串进行比较时,它可能会触发一个静默转换,其效果与在函数中包装列相同。
How can we get around this? Well, fortunately, we can build an index on an expression. An index on lower(name) is perfectly valid, and the query above would use it without complaint. We'll come back to this.
我们如何绕过这个?好吧,幸运的是,我们可以在表达式上建立索引。lower(name) 上的索引是完全有效的,上面的查询会毫无怨言地使用它。我们稍后会回到这个话题。
How Can I Avoid These Pitfalls?
如何避免这些陷阱?
The short answer here is, don't guess. Measure. How? By asking the database, of course!
这里的简短回答是,不要猜测。测量。怎么做?当然是问数据库!
Postgres ships with a tool called EXPLAIN, which tells you exactly how it plans to run a query. Slap it in front of any SELECT and it will return a report of what the database is about to do:
Postgres 附带了一个名为 EXPLAIN 的工具,它可以准确地告诉你它计划如何运行查询。把它放在任何 SELECT 前面,它会返回数据库即将做什么的报告:
EXPLAIN SELECT * FROM pokemon WHERE name = 'Pikachu';
Index Scan using pokemon_name_idx on pokemon
Index Cond: (name = 'Pikachu'::text)
That Index Scan is what you want to see. It means the database is using your index. Compare that with our problem query from earlier:
那个 Index Scan 就是你想要看到的。这意味着数据库正在使用你的索引。把它和我们之前的问题查询比较一下:
EXPLAIN SELECT * FROM pokemon WHERE lower(name) = 'pikachu';
Seq Scan on pokemon
Filter: (lower(name) = 'pikachu'::text)
Seq Scan means it's reading every row. No index involved.
Seq Scan 意味着它正在读取每一行。没有涉及索引。
If you want real timings to go with that rather than the planner's estimates, use EXPLAIN ANALYZE. It actually runs the query and reports what happened. Try it on a query you think you understand. What comes back can often be surprising.
如果你想要真实的计时而不是规划器的估计,请使用 EXPLAIN ANALYZE。它实际上运行查询并报告发生了什么。在你认为你理解的查询上试试。返回的结果往往会令人惊讶。
Things Nobody Told You Existed
没人告诉过你的东西
We've covered the basic single column index and its composite sibling. Between them, they cover the majority of cases. But there are a few other kinds of index that we can field, and they're sometimes just what you need.
我们已经介绍了基本的单列索引及其复合兄弟。在它们之间,它们涵盖了大多数情况。但是还有几种其他类型的索引我们可以使用,它们有时正是你需要的。
Functional Indexes
函数索引
We touched on this earlier, with this query:
我们之前提到了这个,通过这个查询:
SELECT * FROM pokemon WHERE lower(name) = 'pikachu';
We established that an index on name won't help, because our database is looking for rows where lower(name) equals something, and it has no sorted structure for lower(name). The fix is to give it one:
我们确定 name 上的索引没有帮助,因为我们的数据库正在寻找 lower(name) 等于某个东西的行,并且它没有 lower(name) 的排序结构。解决方法就是给它一个:
CREATE INDEX ON pokemon (lower(name));
That is a functional index (also called an expression index). Instead of indexing the raw column, you index the result of an expression applied to the column.
这就是函数索引(也称为表达式索引)。你不是索引原始列,而是索引应用于列的表达式的结果。
This is not limited to lower(), of course. You can index any deterministic expression:
这当然不限于 lower()。你可以索引任何确定性表达式:
CREATE INDEX ON pokemon (lower(name));
CREATE INDEX ON users ((created_at::date));
CREATE INDEX ON pokemon ((base_attack * 2));
You can use any deterministic and immutable function.
你可以使用任何确定性和不可变的函数。
Watch out though. Reaching for functional indexes as your first resort can often be a smell. If we're querying on the lowercase form of the name so often, we should ask ourselves why we haven't stored the name in lowercase. We can always apply lower to the input.
不过要小心。将函数索引作为首选方案往往是一种坏味道。如果我们经常以名称的小写形式查询,我们应该问自己为什么我们没有以小写形式存储名称。我们总是可以对输入应用 lower。
Partial Indexes
部分索引
Most indexes cover every row in the table. But sometimes you only ever query a small slice of it, and indexing the entirety is wasteful. Remember, indexes are not free.
大多数索引覆盖表中的每一行。但有时你只查询其中的一小部分,索引整个表是浪费的。记住,索引不是免费的。
In our Pokémon table, we have a column called is_legendary. For those unfamiliar, at the time of writing, approximately 1000 species of Pokémon exist and of those, only 80 or so are considered legendary. Less than 10%.
在我们的宝可梦表中,我们有一个名为 is_legendary 的列。对于那些不熟悉的人来说,在撰写本文时,大约有 1000 种宝可梦物种,其中只有大约 80 种被认为是传说中的。不到 10%。
You can imagine an application with an option to "show me all legendaries". Initially, one might consider creating a compound index like so:
你可以想象一个带有"显示所有传说宝可梦"选项的应用程序。最初,人们可能会考虑创建一个像这样的复合索引:
CREATE INDEX ON pokemon (is_legendary, name);
This works, but it's overkill. The index now contains an entry for every pokemon, of which the vast majority are not legendary. We'll be paying the storage and write cost for 1000 rows to serve queries that only care about 80. In fact, because is_legendary is a boolean, we end up with a structure where we initially sort into two groups of False and True, where the former half is 920 entries of deadweight, paying their share of every write to the table for queries that ignore them.
这行得通,但这是过杀。索引现在包含每个宝可梦的条目,其中绝大多数不是传说中的。我们将为 1000 行支付存储和写入成本,只为服务只关心 80 行的查询。事实上,因为 is_legendary 是布尔值,我们最终得到一个结构,我们首先排序成 False 和 True 两个组,前者是 920 个死重条目,为忽略它们的查询支付对表的每次写入的份额。
A partial index covers only the rows that match a condition:
部分索引只覆盖匹配条件的行:
CREATE INDEX ON pokemon (name) WHERE is_legendary = true;
Now the index has just 80 entries instead of 1000. It's smaller, faster to query; and queries like WHERE is_legendary = true use it happily. Queries that don't match the condition (WHERE is_legendary = false) fall back to a full table scan, which is exactly what you'd want. Those queries match almost every row anyway, so an index isn't helping much.
现在索引只有 80 个条目而不是 1000 个。它更小,查询更快;像 WHERE is_legendary = true 这样的查询会愉快地使用它。不匹配条件的查询(WHERE is_legendary = false)会回退到全表扫描,这正是你想要的。这些查询无论如何都匹配几乎每一行,所以索引没有太大帮助。
Any time you have a column where you mostly query a single value is a time you might consider a partial index. Imagine indexing the email column of your users table, where you've implemented soft-delete:
任何时候你有一个主要查询单个值的列,都可能是你考虑部分索引的时候。想象一下索引用户表的电子邮件列,在那里你实现了软删除:
CREATE INDEX ON users (email) WHERE deleted_at IS NULL;
Soft-deleted rows are going to be almost never queried, but would still bloat your indexes if you don't filter them out.
软删除的行几乎永远不会被查询,但如果你不过滤它们,仍然会使你的索引膨胀。
Covering Indexes
覆盖索引
When a database uses an index to find a row, that's only half the work. The index points to where the row lives, but then the database must go to the table itself and fetch the rest of the columns the query asked for. Two trips.
当数据库使用索引来查找行时,那只是工作的一半。索引指向行所在的位置,但然后数据库必须转到表本身并获取查询请求的其余列。两次行程。
But what if I told you it doesn't have to?
但如果我告诉你不必这样呢?
If the index already contains every column the query needs, the database can answer the query from the index alone, never touching the table. This is a covering index. When viewed with EXPLAIN, you'll see it as an Index Only Scan, three of the sweetest words EXPLAIN can output.
如果索引已经包含查询所需的每一列,数据库可以仅从索引回答查询,从不接触表。这就是覆盖索引。当用 EXPLAIN 查看时,你会看到它被标记为 Index Only Scan,这是 EXPLAIN 可以输出的最甜美的三个词。
Here's a query:
这是一个查询:
SELECT name FROM pokemon WHERE is_legendary = true;
Our partial index from the previous section happens to cover this query perfectly. The index is on name, filtered to legendaries and the query asks for name from legendaries. Everything the query needs is already in the index. No trip to the table required.
我们上一节的部分索引恰好完美地覆盖了这个查询。索引在 name 上,过滤到传说宝可梦,查询从传说宝可梦中请求 name。查询需要的所有东西都已经在索引中了。不需要去表。
We can also build covering indexes deliberately. Postgres lets you tack extra columns onto an index using INCLUDE:
我们也可以刻意构建覆盖索引。Postgres 允许你使用 INCLUDE 将额外的列附加到索引上:
CREATE INDEX ON pokemon (name) INCLUDE (base_attack);
Now a query like this can be answered entirely from the index:
现在这样的查询可以完全从索引中回答:
SELECT name, base_attack FROM pokemon WHERE name = 'Pikachu';
Why not just put base_attack in the indexed columns themselves? Because the database treats indexed columns as things to sort and search by. If you add base_attack to the index proper, you're telling the database to sort the index by name and then by base_attack, work it has to do on every write. Work that has no benefit when you only ever search by name. INCLUDE is a way to say "carry this column along for the ride, but don't bother sorting by it".
为什么不直接把 base_attack 放在索引列本身中?因为数据库将索引列视为排序和搜索的东西。如果你把 base_attack 添加到索引中,你就是告诉数据库按 name 排序索引,然后按 base_attack 排序,这是它必须在每次写入时做的工作。当你只按 name 搜索时,这项工作没有好处。INCLUDE 是一种说法,"带着这一列一起走,但别费心按它排序"。
Indexes can feel like an afterthought to the novice database developer. Even seasoned engineers often create them without much thought, or leave them off entirely. But intelligent indexing will make or break your database performance.
对于新手数据库开发人员来说,索引可能感觉像是事后的想法。即使是经验丰富的工程师也经常不假思索地创建它们,或者完全不使用它们。但智能索引将决定你的数据库性能的成败。
Want to go deeper? Use The Index, Luke will teach you everything you could ever want to know.
想要更深入?Use The Index, Luke 会教你所有你可能想知道的东西。
原文链接: Things you didn't know about indexes
作者: jon.chrt.dev

查询模式必须匹配索引结构 / Query patterns must match index structure