Interview Questions: Database Indexes
Continuing my series on interview questions, I'm going to spend some time covering ops and sysadmin questions. We'll start by writing up an introduction to database indexes and their structure.
Most consumer-facing web startups these days use one of the major open source databases, either MySQL or PostgreSQL, to some degree. If you want to prove your worth it's a good idea to get down to the nitty gritty and gain some understanding about these databases' internals.
So, the question: "Explain to me what databases indexes are and how they work."
In a nutshell a database index is an auxiliary data structure which allows for faster retrieval of data stored in the database. They are keyed off of a specific column so that queries like "Give me all people with a last name of 'Smith'" are fast.
Database tables, at least conceptually, look something like this:
id age last_name hometown -- -- -- -- 1 10 Johnson San Francisco, CA 2 27 Smith San Joe, CA 3 15 Rose Palo Alto, CA 4 64 Farmer Mill Valley, CA 5 55 Pauling San Francisco, CA 6 17 Smith Oakland, CA ... ... ... ... 100 49 Meyer Berkeley, CA 101 30 Wayne Monterey, CA 102 18 Schwartz San Francisco, CA 104 6 Johnson San Francisco, CA ... ... ... ... 10000 41 Fetterman Mountain View, CA 10001 25 Breyer Redwood City, CA
That is, a table is a collection of tuplesFor bonus points, the "relational" in "relational database" comes from this fact, not from the idea that there are "relations" between tables.. If we have a file like this sitting on disk how do we get all records that have a last name of 'Smith?'
The code would wind up looking something like this:
results =  for row in rows: if row == 'Smith': results.append[row]
Finding the appropriate records requires checking the conditions (here, having a last name of 'Smith') for each row. This is linear in the number of rows which, for many databases, could be millions or billions of rows. Bad news.
How can we make it faster?
Any type of data structure that allows for (potentially) faster access can be considered an index. Let's look at some.
Take the same example from above, finding all people with a last name of 'Smith.' One solution would be to create a hash table. The keys of the hash would be based off of the last_name field and the values would be pointers to the database row.
This type of index is called, unsurprisingly, a "hash index." Most databases support them but they're generally not the default type. Why?
Well, consider a query like this: "Find all people who are younger than 45." Hashes can deal with equality but not inequality. That is, given the hashes of two fields, there's just no way for me to tell which is greater than the other, only whether they're equal or not.
The data structure most commonly used for database indexes are B-trees, a specific kind of self-balancing tree. A picture's worth a thousand words, so here's an example.
The main benefit of a B-tree is that it allows logarithmic selections, insertions, and deletions in the worst case scenario. And unlike hash indexes it stores the data in an ordered way, allowing for faster row retrieval when the selection conditions include things like inequalities or prefixes.
For example, using the tree above, to get the records for all people younger than 13 requires looking at only the left branch of the tree root.
Hash indexes and B-tree indexes are the most common types of database indexes, but there are others, too. MySQL supports R-tree indexes, which are used to query spatial data, e.g., "Show me all cities within ten miles of San Francisco, CA."
There are also bitmap indexes, which allow for almost instantaneous read operations but are expensive to change and take up a lot of space. They are best for columns which have only a few possible values.
Indexes don't come for free. What you gain for in retrieval speed you lose in insertion and deletion speed because every time you alter a table the indexes must be updated accordingly. If your table is updating frequently it's possible that having indexes will cause overall performance of your database to suffer.
There is also a space penalty, as the indexes take up space in memory or on disk. A single index is smaller than the table because it doesn't contain all the data, only pointers to the data, but in general the larger the table the larger the indexTechnically the size of an index is going to be proportional to the cardinality of the column being indexed..
Nodes in a B-tree contain a value and a number of pointers to children nodes. For database indexes the "value" is really a pair of values: the indexed field and a pointer to a database row. That is, rather than storing the row data right in the index, you store a pointer to the row on disk.
For example, if we have an index on an age column, the value in the B-tree might be something like (34, 0x875900). 34 is the age and 0x875900 is a reference to the location of the data, rather than the data itself.
This often allows indexes to be stored in memory even for tables that are so large they can only be stored on disk.
Furthermore, B-tree indexes are typically designed so that each node takes up one disk block. This allows each node to be read in with a single disk operation.
Also, for the pedants among us, many databases use B+ trees rather than classic B-trees for generic database indexes. InnoDB's BTREE index type is closer to a B+ tree than a B-tree, for example.
Database indexes are auxiliary data structures that allow for quicker retrieval of data. The most common type of index is a B-tree index because it has very good general performance characteristics and allows a wide range of comparisons, including both equality and inequalities.
The penalty for having a database index is the cost required to update the index, which must happen any time the table is altered. There is also a certain about of space overhead, although indexes will be smaller than the table they index.
For specific data types different indexes might be better suited than a B-tree. R-trees, for example, allow for quicker retrieval of spatial data. For fields with only a few possible values bitmap indexes might be appropriate.
Good Question, Bad Question
I like this question because it shows whether the interviewee is curious enough to dive into these details. For certain higher-level engineering positions knowing this should be second-nature, but even for a generic web development position knowing how your database works will only help you improve the performance of your web application.
Also, it's just arcane enough that you can go through the motions without knowing it, but not so arcane that it's inaccessible to someone without an advanced education. Any decent programmer should be able to understand it — the exceptional ones will go out of their way to learn it.