database

Understanding Database Fundamentals: Tables, Indexes, and Storage

11 min read
#database

In the databases, the way data is stored, accessed, and queried plays a critical role in performance and efficiency. This blog explores the foundational concepts behind tables, indexes, and their storage on disk. If you're diving into databases, understanding these basics will help you grasp the mechanics that influence query speed and cost.

Tables in Databases: Rows and Columns

A table is the primary structure for organizing and storing data in relational databases. It consists of two main components:

  • Rows: Represent individual records in the table.
  • Columns: Define the properties or fields of those records.

For example, in an employees table:

Employee IDNameDepartmentHire Date
1AliceHR2022-05-01
2BobIT2021-07-15

Each row here represents one employee, and each column captures a specific attribute of that employee.

Row IDs and Unique Identification

Every row in a table must be uniquely identifiable. This is critical for efficient storage, retrieval, and management of data. Databases handle this in different ways:

  1. System-Maintained Identifiers:
    • Some databases, like PostgreSQL, use a special system-generated identifier called a Tuple ID (TID).
    • A Tuple ID is unique to each row and acts like an internal pointer, helping the database locate the row on disk.
  2. Primary Keys as Row IDs:
    • In MySQL and some other databases, the primary key (a user-defined unique identifier) often serves the same purpose as a Row ID.
    • For example, in the table above, Employee ID could be the primary key, uniquely identifying each row.

These identifiers are not always visible to the user but are essential for database operations. They ensure that even if two rows contain similar data, the database can distinguish between them.

Translation to Bits and Bytes

At their core, all tables and their rows and columns are eventually stored as bits and bytes on disk. Here’s how this works:

  • Each row is broken into smaller pieces of data corresponding to its columns.
  • These pieces are encoded into binary form (0s and 1s).
  • The binary data is then written to pages, which are the smallest units of storage on disk.

For example:

  • A row like Employee ID: 1, Name: Alice is stored in a page as binary data.
  • The database uses the Row ID or Tuple ID to locate this specific row within the page.

Pages: The Building Blocks of Storage

A page is the smallest unit of data storage that a database reads or writes to disk. Databases don't operate on individual rows or columns when interacting with the disk. Instead, they read and write pages, which are fixed-size chunks of data.

Why Use Pages

  • Efficiency: Reading or writing one small piece of data (like a single row) would be highly inefficient due to the overhead of disk operations. Instead, the database fetches a whole page that contains multiple rows in one I/O operation.
  • Disk Structure: Pages align with the way disks are organized, making it easier for databases to manage data.

How Are Pages Organized?

Most databases use fixed-size pages because they simplify storage management and optimize disk usage.

  • PostgreSQL: Default page size is 8KB.
  • MySQL: Default page size is 16KB (configurable).

A single page can hold multiple rows of a table, but the number of rows depends on:

  • Row size: Larger rows mean fewer rows per page.
  • Data types: Columns with variable-length data (e.g., VARCHAR) might reduce the number of rows per page.

The Role of Pages in I/O

What Happens During an I/O Operation?

  • When a database retrieves data from disk, it doesn’t fetch just the row you requested. Instead, it fetches the entire page containing that row.
    • For example, if a page contains 10 rows and you query for row 5, the database fetches all 10 rows in that page.

Why Is This Important?

  • Fetching more data than necessary can slow down performance, especially if the data is spread across many pages.
  • Efficient queries aim to minimize the number of pages read, which reduces the total I/O cost.

How Pages Impact Query Performance

Let’s explore a few scenarios:

  1. Small Table:

    • Suppose a table with 10 rows fits entirely within one page.
    • Any query on this table involves just one I/O operation to fetch the page.
  2. Large Table:

    • In a table with 10,000 rows, each page might hold 100 rows.
    • To find a specific row, the database may need to scan multiple pages unless an index is used.
  3. Example Query:

SELECT * FROM employees WHERE Employee ID = 123;
  • If the table doesn’t have an index, Then database scans all pages sequentially to locate the row.
  • If the table has an index, The index helps the database identify which specific page contains the row, reducing the number of pages accessed.

Optimizing Page Usage

  1. Indexes: Indexes allow the database to quickly identify the page containing the desired row, reducing the number of I/O operations.
  2. Minimizing Data Fetch: Avoid using SELECT * unless necessary. Fetch only the columns you need to reduce the size of data transferred.
  3. Efficient Page Layout: Use appropriate data types to ensure rows are compact, allowing more rows to fit in each page.

Imagine a table with 1,000,000 rows stored across 10,000 pages. If a query without an index scans every page, it would result in 10,000 I/O operations. In contrast, a well-designed index might reduce this to just 1 or 2 I/O operations, dramatically improving performance.

By understanding how pages work, you can design better queries and optimize storage layouts to make your database interactions faster and more efficient.

What is a Heap?

A heap is the default data structure used by databases to store the contents of a table. It is essentially an unordered collection of pages where the table's data is written.

Characteristics of Heap Storage

  1. Unordered:
    • Data in the heap is not stored in any particular order. New rows are simply appended to available space in pages.
    • This makes heaps straightforward to manage for writes but inefficient for searching.
  2. Complete Data Storage:
    • Every piece of table data is stored in the heap, including all rows and columns.
    • This means the heap holds all the table's information in its raw form.
  3. Page-Based Storage:
    • The heap is divided into fixed-size pages (e.g., 8KB in PostgreSQL).
    • Each page contains multiple rows, depending on the size of the rows and the database configuration.

Why Are Heaps Still Useful?

Despite being unordered, heaps have advantages:

  1. Fast Writes:
    • Adding new rows to a heap is quick because the database doesn’t need to maintain any order. The new row is simply written to the next available space in a page.
  2. Simple Design:
    • The heap's simplicity makes it the default storage method for many databases when no specific ordering or indexing is required.
  3. Flexibility:
    • Heaps are a good choice for tables that experience frequent inserts, updates, and deletes, especially when queries don’t need to search for specific rows.

What Are Indexes?

An index in a database is like a table of contents in a book. Instead of flipping through every page to find a specific chapter, you consult the index to locate the exact page number. Similarly, database indexes help locate data in the heap without scanning every row.

Why Use Indexes?

  • Faster Lookups: They provide a direct pointer to the data, reducing search time.
  • Reduced I/O: By narrowing down the search to specific pages, indexes minimize the number of disk reads (I/O operations).

How Indexes Work

  1. Creating the Map:

    • When you create an index on a column (e.g., Employee ID), the database organizes the values in that column in a separate data structure, typically a B-tree.
    • Each entry in the index includes:
      • The column value (e.g., Employee ID = 123).
      • A pointer to the corresponding row or page in the heap where the full data is stored.
  2. Searching with an Index:

    • When you query a column with an index:
      • The database looks up the column value in the index.
      • It retrieves the pointer (e.g., the page and row ID) from the index.
      • It directly accesses the relevant page in the heap to fetch the data. Example: Suppose you have the following table:
Employee IDNameDepartmentHire Date
1AliceHR2022-05-01
2BobIT2021-07-15
10,000CarolFinance2020-10-10
  • If you query SELECT * FROM employees WHERE Employee ID = 10000, the index helps the database: Skip scanning rows 1 and 2. Directly locate the page containing Employee ID = 10,000.

Types of Indexes

  1. Clustered Indexes

The clustered index determines the physical order of rows in the heap. The table itself is organized based on the clustered index. There can only be one clustered index per table because the physical order of rows can follow only one sequence.

How It Works: In a table with a clustered index on Employee ID, the rows are stored in the order of Employee ID values. Example: If rows are inserted as 1 for Employee ID, the database reorganizes them physically as 10 in the heap.

Advantages:

  • Faster range queries (e.g., WHERE Employee ID BETWEEN 1 AND 100) because data is stored in the same order as the index.
  • Saves I/O operations since fewer pages need to be fetched.

MySQL uses the primary key as the clustered index by default.

  1. Secondary Indexes

A secondary index is an additional structure that exists independently of the heap's physical order. It points to the rows in the heap using Row IDs.

How It Works: Secondary indexes store:

  • The indexed column value (e.g., Name = Bob).
  • A pointer to the heap's Row ID (or page number).

The heap remains unordered, but the secondary index provides a quick way to locate data.

Advantages:

  • Useful for lookups on non-primary key columns (e.g., searching by Name or Hire Date).

Disadvantages:

  • Updates or inserts can be slower because secondary indexes need to be updated when the table changes.
  • More disk space is required to store additional index structures.

Clustered vs. Secondary Indexes

FeatureClustered IndexSecondary Index
Order of DataPhysically reorganizes the tableDoes not affect the table's order
Number AllowedOne per tableMultiple per table
Use CasePrimary key or main sorting columnNon-primary key lookups
PerformanceFaster for range queriesSlower than clustered for large scans
MaintenanceChanges affect the table's physical layoutChanges update only the index

How B-Trees Support Indexes

The B-tree is the most commonly used data structure for database indexes because it balances speed and space efficiency.

Key Features of B-Trees:

  1. Hierarchical Structure:
    • Data is stored in a tree with a root node, intermediate nodes, and leaf nodes.
    • Each node contains a sorted list of key values (e.g., Employee IDs) and pointers to child nodes.
  2. Fast Search:
    • To locate Employee ID = 10,000, the database traverses the tree:
      • Compares 10,000 with the values in the root node.
      • Follows the pointer to the appropriate child node.
      • Repeats until the value is found in a leaf node.
  3. Balanced Tree:
    • The tree is balanced, meaning all leaf nodes are at the same depth.
    • This ensures consistent performance for lookups, even as the dataset grows.

The Cost of Indexes

While indexes improve query speed, they come with trade-offs:

  1. Storage Overhead:
    • Indexes consume additional disk space. More indexes mean higher storage requirements.
  2. Update Overhead:
    • Inserts, updates, and deletes require the index to be updated, which can slow down write operations.
  3. Index Selection:
    • Choosing the wrong columns for indexing can result in poor performance and wasted resources.

Summary

In this blog, we talked about how tables, pages and indexs work in database. Learning these Fundamentals concepts about database, helps backend engineer to write efficient query.