Accessing data stored on a disk drive takes on average three orders of magnitude longer than accessing data stored in main memory. It is therefore not surprising that the time for moving blocks of data from the disk to main memory usually dominates query execution time in DBMSs [20].

The storage manager is a component of a
database that determines how data is written on and read from the disk drive.
We define its primary purpose to be the *strategic placement of data on the
disk drive such that the average query execution time can be minimal.* Of
course, execution time depends on other factors as well. Without prejudice to
its primary purpose, the storage manager shall *store data as compactly as
possible on the disk*.

In Section 2.2, we presented the mechanic properties of disk drives and showed how the time to service a read or write request can be broken down into data access time and data transfer time. Ignoring head and track switch time and multiple zone recording, data transfer time is just a constant determined by the rotational speed of the disk and the number of sectors per track. That leaves the storage manager with the optimization of data access time; that is, seek time and rotational latency.

To minimize seek time, data items that are frequently accessed together should be placed in the same cylinder, or in cylinders that are as close as possible.

Data placement for minimizing rotational latency is more complex. In general, data items that are frequently accessed together should be placed in blocks that are as close as possible on the circumference. If the items are placed in the same cylinder in different tracks, offsets for head switch time must be added. If the items are placed in different cylinders, offsets for seek time or track switch time must be added. Figure 5 illustrates the complexity of finding rotational latency-minimizing placements. It shows several locations of data item that, given the placement of data item and assuming that is accessed after , all may yield zero rotational latency.

Ideally, the storage manager analyzes the disk drive and derives a cost model to evaluate different placements of sets of data items based on access time. Sets of data items that are frequently accessed together are assigned a higher weight in the model.

In practice, storage managers in modern database systems typically do not tackle the specific placement in cylinders, surfaces, and tracks. Instead, they implicitly assume a model of linear storage based on logical block addresses (see Section 2.2 and especially Figure 2). Data that is frequently accessed together is placed either in the same block or in blocks that are as close as possible by their disk addresses. While this might not always yield an optimal data placement, it greatly reduces complexity. The model is hardware-independent, as the mapping of sectors to logical block addresses is supplied by the disk drive.

The storage manager should utilize existing information to find an optimal placement. In a relational database, for instance, all records in a table could be considered items that should be stored together. For several types of data, the storage manager may be able to identify deeper structures. Integrity constraints and foreign key relationships can contain hints.

Graph data by its nature is highly structured and has abundant potential for storage optimization. The most basic desideratum for a storage manager for graph data should be to

(1) store vertices together with incident edges.

We have seen in Sections 3.2 and 4.2 that graph algorithms frequently access the labels and edges of a vertex together. A storage manager for graph data should therefore store them as close as possible. As close as possible means in the same block. If it is not possible to store them in the same block (e.g., because the vertex has more edges than can fit in a block), they should be stored in blocks that are adjacent by their disk addresses.

Henceforth, we take storing a vertex to mean storing its labels and edges together. We use to denote the number of bytes that the encoding of an arbitrary vertex uses on the disk, and we use to denote the number of bytes in a block that can hold vertex data.

At the beginning of this section, we proposed that, without prejudice to the overriding objective of finding an optimal data placement, the storage manager shall place data as compactly on the disk as possible. Applied to desideratum (1), the storage manager may store several vertices in the same block, but it may not break up a vertex into two blocks if .

In Section 4.1.1, we argued that vertex labels and edges should be represented in separate tables in a relational database. Here, we are thinking about how the two tables should be stored on the disk, which is a decision of the storage manager and hidden from the user. Desideratum (1) suggests that the edges table should be stored grouped on column a and interleaved with the vertices table. This may adversely affect the execution time of simple SELECT queries over one table, but it can increase the execution speed of queries that involve graph navigation.

Graph navigation is at the heart of most graph algorithms. The second desideratum is therefore to

(2) store vertices together with vertices in their neighborhood.

For most graphs, it will not be possible to place each vertex together with its neighbors in the same block or in adjacent blocks. An optimal placement is then one in which as many vertices as possible are stored as close as possible. Let us define an optimization problem:

Let be the graph that we are trying to store on the disk. If contains unordered pairs of vertices, convert them to symmetric directed pairs. Let us assume that . Moreover, let us assume that the disk has enough free space to store the entire graph in consecutive blocks, and let these blocks be numbered 0, 1, …. Let ) be a function that maps the vertices of to these blocks. We use to denote the set , and we use to denote the sum .

We propose that an optimal placement by desideratum (2) solves the following problem:

Put differently, let the cost of a particular placement be the sum of the edges that cross block boundaries multiplied by the distance between blocks. An optimal placement by desideratum (2) minimizes this cost.

It is easy to see that if is connected, can only be minimal if maps to consecutive blocks. Even if contains multiple components, some queries (e.g., SELECT over ) benefit if the components are stored in groups of blocks adjacent to each other. Let us therefore redefine to be a surjective function , where is not known.

Our optimization problem is related to a
problem that first appeared in [9], known as the *minimum linear arrangement
(MinLA) problem*. For a given graph ,
the MinLA problem is to find a bijective function that
minimizes cost .

The MinLA problem has been proved to be NP-hard, and its decision problem to be NP-complete [8]. The MinLA problem would be equivalent to our optimization problem if we knew which vertices to put together in blocks, but not the order in which to put the blocks on the disk.

Several heuristic algorithms have been proposed that find near optimal solutions for MinLA problems. Among the most successful are spectral sequencing [11], multilevel-based algorithms [16, 22], a divide-and-conquer algorithm [2], and simulated annealing [19, 18]. Petit [19] compiled a suite of experiments with several of these algorithms. For the largest graph with 10,240 vertices and 30,380 edges, execution times varied between 4 seconds (spectral sequencing) and 12.83 hours (simulated annealing).

MinLA algorithms can be applied to our placement problem in two ways. Let be the graph that we are trying to store on the disk:

[A] Create a list of possible mappings from to consecutive blocks. Convert each mapping to a graph: Each block is a vertex and each edge is an edge between the blocks that and are mapped to. The graphs may contain loops and parallel edges.

Solve the MinLA problem for each graph. The mapping for the graph with the overall lowest , together with its ordering function define a function that solves our placement problem.

[B] Solve the MinLA problem for . Set two integer variables and to 0. While do the following:

• Set to be the vertex with .

• If , increment .

• Set and increment .

Suppose the MinLA algorithm was capable of finding an optimal linear arrangement for every graph. Then approach [A] was guaranteed to find a function that solved our placement problem. Unfortunately, the number of possible mappings of to blocks increases exponentially with . A modern computer could probably not even enumerate all mappings for a large graph. Approach [A] is therefore not useful for a storage manager.

Approach [B] is not guaranteed to find an optimal . However, the resulting is likely to be close to optimal. The MinLA algorithm must be capable of dealing with very large graphs since it is applied directly to the input graph.

Figure 6 shows two linear arrangements for
an example graph. The arrangement shown under (a) is the MinLA. The
edge-weighted graph below each arrangement illustrates how the algorithm
described above under [B] would convert the arrangement into a disk
representation. We call such a graph a *block graph*. Each vertex in a
block graph corresponds to one block on the disk. The block graph contains an
edge if , , , . The weight of edge is the cardinality of set . The block graph preserves enough
information to calculate . Notice that the
placement derived from the optimal arrangement
is inferior to the placement shown in (b).

Suppose we could find a placement that solved the minimization problem for . Would it be optimal? assumes that accessing blocks and together takes exactly half of the time to access blocks and together. In practice, deviations from this assumption may be substantial.

Suppose that , , , and . By , this placement is optimal if and cannot be placed in the same block. Assume a graph traversal algorithm that prints the labels of vertices it finds explores . After retrieving block from the disk, it is quite possible that block has just passed under the disk head when the algorithm returns from printing the label of . In the worst case, the algorithm has to wait for a whole rotation of the disk until block passes under the disk head again.

The read-ahead functionality of the disk drive may mitigate the problem. However, read-ahead works only in forward direction. If and , there is no guarantee that block is in the disk’s cache after block has been read. does not differentiate between edge and edge .

Moreover, read-ahead is usually only performed when the disk drive’s request queue is empty. If the algorithm requests several different blocks at the same time, read-ahead data may only be available for the last block read.

A storage manager that makes placement decisions based on should implement its own buffer and buffer management. Whenever a block is read, a uniform number of blocks before and after block should be retrieved from the disk.

Instead of relying only on caching and buffering, we suggest refining the placement. The third and fourth desiderata for a storage manager for graph data are to

(3) minimize the total edge weight in the block graph, and to

(4) minimize the number of edges in the block graph.

Again, we can express the desiderata as optimization problems over and :

Notice the similarity between and . penalizes edges that join vertices in different blocks by the distance between blocks. penalizes such edges uniformly. Put differently, rewards vertices that are stored close to most of their neighbors, while rewards vertices that are stored in the same block with most of their neighbors.

Minimizing tends to leave short edges between blocks, minimizing tends to create clusters of vertices in the blocks. Clusters are useful, as they may enable graph algorithms to run several iterations based on the data in just a few blocks. Especially DFS algorithms benefit, as they need only a single unexplored neighbor in each iteration to continue.

Minimizing channels the edges in to as little edges in the block graph as possible. The function ignores the number of edges in that span across two blocks. The reasoning is the following: Reading a block from the disk takes a certain amount of time, regardless of how many vertices in the block are needed in the graph algorithm. It is therefore sensible to try to reduce the number of blocks that the neighbors of the vertices in a block are stored in.

Minimizing is
a special case of the *graph partitioning problem*, a fundamental and
well-researched problem in graph theory. Given a graph ,
its objective is to divide into disjoint subsets, ,
such that the number of edges that join vertices from different partitions is
minimal, and , . and are given. Each set is
called a *partition*.

The graph partitioning problem has been formulated in several versions. A version for edge-weighted graphs minimizes not the number but the total weight of edges that cross partitions. A version for vertex-weighted graphs uses the following as the balancing constraint: , where is the weight of vertex . In another version, the maximum partition size can be specified per partition. In [15], the problem is extended for graphs where each vertex has multiple weights.

The problem is relevant to many applications, including parallel scientific computing, task scheduling, and VLSI design. Graph partitioning is used, for instance, in scientific simulations, where computation is performed iteratively on each element of a physical 2D or 3D mesh model (a model of an airfoil, e.g.), to map the mesh elements to processors such that the load is roughly equal and inter-processor communication is minimized.

The graph partitioning problem is NP-complete [8], but because of its relevance, a large number of heuristic algorithms have been developed. A detailed survey with examples can be found in [23].

The graph partitioning problem has traditionally been approached with recursive bisection. That is, by first dividing into two partitions, then dividing each of these partitions into two partitions, and so on. In the mid-90s, Karypis and Kumar [13] proposed a multilevel algorithm that achieved a -way partitioning in one run. This has since become the second standard approach to graph partitioning.

Because most of the proposed algorithms are complex, several authors have published open source ready-to-use tools that partition arbitrary graphs. The two most widely used tools are Chaco version 2.0 (Hendrickson and Leland, published 1995, approx. 30,000 lines of C code [10]) and Metis version 4.0 (Karypis and Kumar, published 1998, approx. 22,000 lines of C code [14]).

The optimization problem for is not a partitioning problem because , the number of blocks in the disk representation, is not known. To use an existing partitioning algorithm, has to be guessed. Given , a partitioning algorithm for vertex-weighted graphs can be used. The weight of each vertex is given by . To get on the right side of the balancing constraint, parameter has to be set to . must be chosen such that is non-negative. Choosing such that means that the algorithm cannot leave any block with free space. This might not be possible. In general, several different values for should be tried.

Unfortunately, most partitioning algorithms have been designed for a relatively small number of partitions. Chaco implements only , , and . Metis (algorithm kmetis, default parameters) has a memory complexity of , which makes it impracticable to get a result for large . For reference, a database with 5000 blocks at 16 KB per block can merely hold 80 MB of data.

Finally, existing tools should always be used with care. Metis, for instance, has several cases based on hard-coded parameters built in that allow individual partitions to violate the balancing constraint; that is, grow too large. For our purpose, the balancing constraint has to be strict.

We have yet to tackle the optimization problem for . Both Chaco and Metis implement procedures that try to reduce the number of partitions that each partition is linked to. Unfortunately, these features are not well documented. However, it shows that the minimization problems for and can be addressed together.

Figure 7 shows three placements of an example graph with 16 vertices into disk blocks. It shows the block graph for each placement, as well as the cost according to , , and . Boldface indicates the lowest value for each function.

The example graph is undirected. For simplicity, the block graphs are shown as undirected graphs as well. To determine the values of the cost functions, each edge is treated as two symmetric directed edges with the same weight.

Which of the three placements is preferred? And how does it relate to functions , , and ? These questions are difficult and we cannot answer them definitively. Which placement is preferred depends on a number of factors, including the type of query that is most often run, the characteristics of the graph, and the available buffer memory. We might prefer alternative (b) for BFS queries because regardless of which vertex is the source vertex, there will only be two disk accesses: One to retrieve the block in which the source vertex is stored, and one to retrieve all remaining blocks. We might prefer alternative (c) if block 2 can be kept in the buffer permanently.

A storage manager should combine the cost functions in an intelligent way. Unfortunately, , , and are based on different scales and difficult to compare. always holds. The next chapter describes how G-Store uses the functions.