visit
B-Tree stores the columns in an ordered manner which makes them useful for searching over a range of data, ORDER BY
and GROUP BY
operations.
CREATE TABLE EMPLOYEE (
first_name varchar(40) not null,
last_name varchar(40) not null,
department varchar(40) not null,
joining_date date not null
);
first_name | last_name | department | joining_date |
---|---|---|---|
Jaxon | Armstrong | Engineering | 2005-05-20 |
Donnas | Mckee | Sales | 2015-08-02 |
Donnas | Walton | Finance | 2005-01-10 |
Donnas | Hardy | Engineering | 2023-05-20 |
Donnas | Benton | Product | 2010-07-28 |
Donnas | Sawyer | Sales | 2005-03-29 |
Donnas | Schaefer | Finance | 2019-01-18 |
Donnas | Bauer | Engineering | 2020-09-14 |
Tomas | Pennington | Engineering | 2019-05-20 |
ALTER TABLE EMPLOYEE (
ADD INDEX first_name,
ADD INDEX department,
ADD INDEX joining_date
);
Query 1:
SELECT * FROM EMPLOYEE WHERE first_name = 'Donna' and department = 'Engineering'
-- Uses the Index on first_name
Even though there is an Index on the department
column, we only use the index on first_name
as the three individual indexes act as three separate tables.
Now when we run this query, MySQL will have to scan across seven rows where the first_name
is Donna; imagine the same scan across a huge table where around 1 million people have a similar name. Adding a single-column index won't help a lot.
ALTER TABLE EMPLOYEE (
ADD INDEX (department, first_name, joining_date)
);
The following table contains a composite index on (department, first_name, joining_date)
.
Query 1:
SELECT * FROM EMPLOYEE WHERE department = 'Engineering' AND first_name = 'Donna'
-- Uses the Index on (department, first_name)
As compared to single column index, this will be faster as the number of rows to scan will get decreased by using the composite index on (department, first_name)
Query 2:
SELECT * FROM EMPLOYEE WHERE last_name = 'Hardy'
-- Doesn't use any Index
Query 3:
SELECT * FROM EMPLOYEE
WHERE first_name = 'Donna'
AND joining_date > 2010-07-28
-- Doesn't use any Index
-- since the lookup follows the leftmost side of indexed columns
-- we are not mentioning the department
-- hence composite index is of no use
Query 4:
SELECT * FROM EMPLOYEE
WHERE department = 'Engineering'
AND first_name LIKE '%S%'
AND joining_date = 2010-07-28
-- Uses the Index on (department, first_name)
-- since for first_name we have a range query
-- it will only use first two columns of index
For the Multi-column Indexes in this Table, we have index search capabilities on (department)
, (department, first_name)
and (department, first_name, joining_date)
SELECT * FROM EMPLOYEE WHERE department = 'Engineering' ORDER BY last_name
This will use the department
index to give us the records and then perform a sort on the records to sort them by last_name
. This is because we didn't have any index last_name
hence the records were not sorted by that column. This type of sort fetches the data in a temporary buffer before returning it (File Sort), which leads to extra computation.
This can be avoided by adding an index on last_name
. The new composite index for us would be (department, first_name, joining_date, last_name)
SELECT * FROM EMPLOYEE WHERE department = 'Engineering' GROUP BY first_name
The records would already be sorted by department
hence speeding up the filtering. After these results are returned, they are also then sorted by first_name
due to the second part of the index, and so they are already grouped for us. This won't require any other sorting hence making our query fast.
Index selectivity is the ratio of a number of distinct indexed values(cardinality) to the total number of rows in the table(#T). The range for selectivity is from 1/#T to 1.
Consider the same Employee Table Schema with the multicolumn index on (department, first_name, joining_date, last_name)
. This composite index will perform better than a single column on queries, but let's check if this is the perfect column order for the given set of values in the Employee Table.
first_name
: 3/9 = 0.33Department
: 4/9 = 0.44last_name
: 9/9 = 1
Since joining_date
is a date field, and we might want to do range queries on it; it is good to have it last in the order of the index.
Based on the Selectivity, the ideal order looks like this (last_name, department, first_name, joining_date)
.
If we see the Query Examples above, the initial order we defined works well for us as we can use Indexes for most queries. Hence along with selectivity, the use case of the WHERE
clause also matters.
WHERE
conditions to effectively lookup data
Use of ORDER BY
and GROUP BY
Use of Range
conditions in the query
Index Selectivity
when unsure about the column order
The columns we want to SELECT
in the Query
It is important to note that the indexes shouldn't be optimized only for the WHERE clause but for the entire query.
Bad Query:
SELECT * FROM EMPLOYEE
WHERE department = 'Engineering'
AND YEAR(joining_date) = 2010
The above query will only use the index department
even though an index is present on joining_date
. This is because we have enclosed joining_date
within a function.
INSERT
, UPDATE
, and DELETE
operations. Every time we perform a write on a table, the indexes have to be maintained. Hence, it's important to choose Index wisely.EXPLAIN
statement to understand how the query is executing and reveal why it is slow. Explain statement shows values such as the indexes used, order used for joining the table, rows accessed, use of Filesort or not, and a temporary table used or not.
Also published here.