Indexes. An introduction.

‘The acceleration of databases through indexing,’ a notion that often floats like mist around the topics of databases and speed. Not incorrect, yet the truth is decidedly more complex. An index is, at its core, a mere replica of a subset of the data held within the table. Notably, the physical storage location of the data on the storage medium bears no relation to the logical arrangement of the index. Where exactly the data is physically stored has become increasingly obscure, especially now that we’re moving away from spinning magnetic storage to solid-state drives and cloud hosting.

To navigate and update an index efficiently, a logical structure is imperative. For this purpose, we employ two technologies: the doubly linked list to interconnect the so-called leaf nodes. Each node is housed within a data block or a page. All data blocks within an index are uniform in size. In SQL Server, for instance, 8K pages (Hello to: ‘NTFS cluster size = 64K’). The database crams as many index entries as possible into each block. But the order of these indices must be maintained on two levels: within the individual blocks and across the blocks through the linking.

The second structure, the B-Tree, a balanced tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time

The B-Tree is a self-balancing tree that keeps data sorted and allows for efficient insertion, deletion, and lookup operations. It’s the backbone of index performance in many database systems, including SQL Server. The beauty of a B-Tree lies in its ability to maintain balance with each operation, ensuring the path from the root to any leaf node is of the same length, thus guaranteeing efficient data retrieval operations.

When a B-Tree index is searched, it begins at the root and travels down to the appropriate leaf node. The nodes contain entries that act as separators, determining which path to follow.

The entries of the root node are compared in sequence with the search term ’42’ until an entry is found that is greater than or equal to >= the search term. The RDBMS follows the reference to the corresponding branch node and repeats the procedure until it reaches a leaf node.

The B-Tree excels when it comes to range queries. Thanks to its sorted nature, once the starting point of the range is found, it is a simple matter of following the leaf node chain to retrieve all the relevant entries.

Access

Accessing an index typically involves three steps: traversing the tree, following the leaf node chain, and accessing the table. While the tree structure limits the number of read operations required in the first step, the subsequent steps may necessitate reading many blocks, potentially slowing down access. Herein lies the conundrum of the index – it promises speed, but this promise is fully delivered in only one of the steps. The other steps can be slow, giving rise to the fallacy of the ‘broken index’.

Unique Index Seek

A ‘Unique Index Seek’ occurs when a query is looking for a unique value that is enforced by a uniqueness constraint, such as a primary key or a unique index. The operation traverses the B-Tree structure to find a specific entry that matches a unique key value. It’s quick because it precisely locates the data point, you’re after.

Index Seek & Index Scan

In contrast, an ‘Index Scan’ might iterate through the entire index if a range of values is being sought, which can be slow for large datasets. An ‘Index Seek’ is more refined, targeting only the specified range within the index.

Lookups

When a query uses non-clustered indexes, SQL Server might perform a ‘Key Lookup’ (or ‘RID Lookup’ for heap tables) to retrieve the actual data rows. This is where performance can be hindered:

Key Lookup: SQL Server uses the clustered index to find the primary key value and then retrieves the rest of the row from the table. It’s relatively quick because there’s a direct path to the full data row.

RID Lookup: For heap tables (tables without a clustered index), SQL Server uses a Row Identifier to find the data. This process can be slower because it involves searching through the heap to match the RID, which is less structured than a B-Tree.

Don’t do heaps

Good luck finding the part you are looking for.

RID Lookups are slower due to the lack of a structured path to locate the data. Without a clustered index, SQL Server must sift through the heap – an unordered table – to find the matching row. This is, you guessed it, less efficient then your alternative. A clustered index table. There it is again. The index.

Conclusion

So only the ‘Unique Index Seek’ is definitely quick, the Index Seek can be quick, depending on the size the Index SQL Server needs to go through. The Index scan runs through the index sequentially. This can be time consuming, depending on the size of the index. After that you might need to get additional data, not contained in the index. MS SQL Server fetches this data from the table. For every index hit, SQL Server retrieves any additional data specified in the query. This can lead to a query execution not as swift as one might anticipate. It is through comprehending the architecture of an index that we discover its potential to be less than rapid; an insightful revelation indeed.

Deep Dive into AI and a Sea of Data

I cannot deny my background as a database administrator. Initially, when I started delving into neural networks, I tried to

A closer look

Once in a while, you stumble upon a nice feature of SQL Server that you knew existed, but never really