Accessing data stored on a disk drive takes on average three orders of magnitude longer than accessing data stored in main memory. It is therefore not surprising that the time for moving blocks of data from the disk to main memory usually dominates query execution time in DBMSs [20].
The storage manager is a component of a database that determines how data is written on and read from the disk drive. We define its primary purpose to be the strategic placement of data on the disk drive such that the average query execution time can be minimal. Of course, execution time depends on other factors as well. Without prejudice to its primary purpose, the storage manager shall store data as compactly as possible on the disk.
In Section 2.2, we presented the mechanic properties of disk drives and showed how the time to service a read or write request can be broken down into data access time and data transfer time. Ignoring head and track switch time and multiple zone recording, data transfer time is just a constant determined by the rotational speed of the disk and the number of sectors per track. That leaves the storage manager with the optimization of data access time; that is, seek time and rotational latency.
To minimize seek time, data items that are frequently accessed together should be placed in the same cylinder, or in cylinders that are as close as possible.
Data placement for minimizing rotational
latency is more complex. In general, data items that are frequently accessed
together should be placed in blocks that are as close as possible on the
circumference. If the items are placed in the same cylinder in different
tracks, offsets for head switch time must be added. If the items are placed in
different cylinders, offsets for seek time or track switch time must be added.
Figure 5 illustrates the complexity of finding rotational latency-minimizing
placements. It shows several locations of data item  that, given the placement of data item
 that, given the placement of data item  and assuming that
 and assuming that  is accessed after
 is accessed after  , all may yield zero
rotational latency.
, all may yield zero
rotational latency.

Ideally, the storage manager analyzes the disk drive and derives a cost model to evaluate different placements of sets of data items based on access time. Sets of data items that are frequently accessed together are assigned a higher weight in the model.
In practice, storage managers in modern database systems typically do not tackle the specific placement in cylinders, surfaces, and tracks. Instead, they implicitly assume a model of linear storage based on logical block addresses (see Section 2.2 and especially Figure 2). Data that is frequently accessed together is placed either in the same block or in blocks that are as close as possible by their disk addresses. While this might not always yield an optimal data placement, it greatly reduces complexity. The model is hardware-independent, as the mapping of sectors to logical block addresses is supplied by the disk drive.
The storage manager should utilize existing information to find an optimal placement. In a relational database, for instance, all records in a table could be considered items that should be stored together. For several types of data, the storage manager may be able to identify deeper structures. Integrity constraints and foreign key relationships can contain hints.
Graph data by its nature is highly structured and has abundant potential for storage optimization. The most basic desideratum for a storage manager for graph data should be to
(1) store vertices together with incident edges.
We have seen in Sections 3.2 and 4.2 that graph algorithms frequently access the labels and edges of a vertex together. A storage manager for graph data should therefore store them as close as possible. As close as possible means in the same block. If it is not possible to store them in the same block (e.g., because the vertex has more edges than can fit in a block), they should be stored in blocks that are adjacent by their disk addresses.
Henceforth, we take storing a vertex to
mean storing its labels and edges together. We use  to
denote the number of bytes that the encoding of an arbitrary vertex
 to
denote the number of bytes that the encoding of an arbitrary vertex  uses on the disk, and we use
 uses on the disk, and we use
 to denote the number of bytes in a
block that can hold vertex data.
 to denote the number of bytes in a
block that can hold vertex data.
At the beginning of this section, we
proposed that, without prejudice to the overriding objective of finding an optimal
data placement, the storage manager shall place data as compactly on the disk
as possible. Applied to desideratum (1), the storage manager may store several
vertices in the same block, but it may not break up a vertex  into two blocks if
 into two blocks if  .
.
In Section 4.1.1, we argued that vertex labels and edges should be represented in separate tables in a relational database. Here, we are thinking about how the two tables should be stored on the disk, which is a decision of the storage manager and hidden from the user. Desideratum (1) suggests that the edges table should be stored grouped on column a and interleaved with the vertices table. This may adversely affect the execution time of simple SELECT queries over one table, but it can increase the execution speed of queries that involve graph navigation.
Graph navigation is at the heart of most graph algorithms. The second desideratum is therefore to
(2) store vertices together with vertices in their neighborhood.
For most graphs, it will not be possible to place each vertex together with its neighbors in the same block or in adjacent blocks. An optimal placement is then one in which as many vertices as possible are stored as close as possible. Let us define an optimization problem:
Let  be
the graph that we are trying to store on the disk. If
 be
the graph that we are trying to store on the disk. If  contains unordered pairs of vertices, convert
them to symmetric directed pairs. Let us assume that
 contains unordered pairs of vertices, convert
them to symmetric directed pairs. Let us assume that  .
Moreover, let us assume that the disk has enough free space to store the entire
graph in consecutive blocks, and let these blocks be numbered 0, 1, …. Let
.
Moreover, let us assume that the disk has enough free space to store the entire
graph in consecutive blocks, and let these blocks be numbered 0, 1, …. Let  ) be a function that maps the vertices
of
) be a function that maps the vertices
of  to these blocks. We
use
 to these blocks. We
use  to denote the set
 to denote the set  , and we use
, and we use  to
denote the sum
 to
denote the sum  .
.
We propose that an optimal placement by desideratum (2) solves the following problem:

Put differently, let the cost of a particular placement be the sum of the edges that cross block boundaries multiplied by the distance between blocks. An optimal placement by desideratum (2) minimizes this cost.
It is easy to see that if  is connected,
 is connected,  can only be minimal if
 can only be minimal if  maps
 maps  to consecutive blocks. Even if
 to consecutive blocks. Even if  contains multiple components, some
queries (e.g., SELECT over
 contains multiple components, some
queries (e.g., SELECT over  ) benefit if the components are stored in
groups of blocks adjacent to each other. Let us therefore redefine
) benefit if the components are stored in
groups of blocks adjacent to each other. Let us therefore redefine  to be a surjective function
 to be a surjective function  , where
, where  is not known.
 is not known.
Our optimization problem is related to a
problem that first appeared in [9], known as the minimum linear arrangement
(MinLA) problem. For a given graph  ,
the MinLA problem is to find a bijective function
,
the MinLA problem is to find a bijective function  that
minimizes cost
 that
minimizes cost  .
.
The MinLA problem has been proved to be NP-hard, and its decision problem to be NP-complete [8]. The MinLA problem would be equivalent to our optimization problem if we knew which vertices to put together in blocks, but not the order in which to put the blocks on the disk.
Several heuristic algorithms have been proposed that find near optimal solutions for MinLA problems. Among the most successful are spectral sequencing [11], multilevel-based algorithms [16, 22], a divide-and-conquer algorithm [2], and simulated annealing [19, 18]. Petit [19] compiled a suite of experiments with several of these algorithms. For the largest graph with 10,240 vertices and 30,380 edges, execution times varied between 4 seconds (spectral sequencing) and 12.83 hours (simulated annealing).
MinLA algorithms can be applied to our
placement problem in two ways. Let  be the graph that
we are trying to store on the disk:
 be the graph that
we are trying to store on the disk: 
[A]  Create a list of possible mappings
from  to consecutive
blocks. Convert each mapping to a graph: Each block is a vertex and each edge
 to consecutive
blocks. Convert each mapping to a graph: Each block is a vertex and each edge  is an edge between the blocks that
 is an edge between the blocks that  and
 and  are mapped to. The graphs may contain loops
and parallel edges.
 are mapped to. The graphs may contain loops
and parallel edges.
Solve the MinLA
problem for each graph. The mapping for the graph with the overall lowest  , together with its ordering function
, together with its ordering function  define a function
 define a function  that solves our placement problem.
 that solves our placement problem.
[B]  Solve the MinLA problem for  . Set two integer variables
. Set two integer variables  and
 and  to
0. While
 to
0. While  do the following:
 do the following: 
•    Set  to be the vertex with
 to be the vertex with  .
.
•    If  , increment
, increment  .
. 
•    Set  and increment
 and increment  .
. 
Suppose the MinLA algorithm was capable of
finding an optimal linear arrangement for every graph. Then approach [A] was
guaranteed to find a function  that solved our
placement problem. Unfortunately, the number of possible mappings of
 that solved our
placement problem. Unfortunately, the number of possible mappings of  to blocks increases
exponentially with
 to blocks increases
exponentially with  . A modern computer could
probably not even enumerate all mappings for a large graph. Approach [A] is
therefore not useful for a storage manager.
. A modern computer could
probably not even enumerate all mappings for a large graph. Approach [A] is
therefore not useful for a storage manager.
Approach [B] is not guaranteed to find an
optimal  . However, the resulting
. However, the resulting  is likely to be close to optimal. The
MinLA algorithm must be capable of dealing with very large graphs since it is
applied directly to the input graph.
 is likely to be close to optimal. The
MinLA algorithm must be capable of dealing with very large graphs since it is
applied directly to the input graph.
Figure 6 shows two linear arrangements for
an example graph. The arrangement shown under (a) is the MinLA. The
edge-weighted graph below each arrangement illustrates how the algorithm
described above under [B] would convert the arrangement into a disk
representation. We call such a graph a block graph. Each vertex in a
block graph corresponds to one block on the disk. The block graph contains an
edge  if
 if  ,
,  ,
,  ,
,  . The weight of edge
. The weight of edge  is the cardinality of set
 is the cardinality of set  . The block graph preserves enough
information to calculate
. The block graph preserves enough
information to calculate  . Notice that the
placement derived from the
. Notice that the
placement derived from the  optimal arrangement
is inferior to the placement shown in (b).
 optimal arrangement
is inferior to the placement shown in (b). 

Suppose we could find a placement that
solved the minimization problem for  . Would it be
optimal?
. Would it be
optimal?   assumes that accessing blocks
 assumes that accessing blocks  and
 and  together
takes exactly half of the time to access blocks
 together
takes exactly half of the time to access blocks  and
 and  together.
In practice, deviations from this assumption may be substantial.
 together.
In practice, deviations from this assumption may be substantial.
Suppose that  ,
,  ,
,  ,
and
,
and  . By
. By  ,
this placement is optimal if
,
this placement is optimal if  and
 and  cannot be placed in the same block. Assume a
graph traversal algorithm that prints the labels of vertices it finds explores
 cannot be placed in the same block. Assume a
graph traversal algorithm that prints the labels of vertices it finds explores  . After retrieving block
. After retrieving block  from the disk, it is quite
possible that block
 from the disk, it is quite
possible that block  has
just passed under the disk head when the algorithm returns from printing the
label of
 has
just passed under the disk head when the algorithm returns from printing the
label of  . In the worst
case, the algorithm has to wait for a whole rotation of the disk until block
. In the worst
case, the algorithm has to wait for a whole rotation of the disk until block  passes under the disk head
again.
 passes under the disk head
again.
The read-ahead functionality of the disk
drive may mitigate the problem. However, read-ahead works only in forward
direction. If  and
 and  ,
there is no guarantee that block
,
there is no guarantee that block  is in the disk’s cache after block
 is in the disk’s cache after block  has been read.
 has been read.  does not differentiate between edge
 does not differentiate between edge  and edge
 and edge  .
. 
Moreover, read-ahead is usually only performed when the disk drive’s request queue is empty. If the algorithm requests several different blocks at the same time, read-ahead data may only be available for the last block read.
A storage manager that makes placement
decisions based on  should implement its own
buffer and buffer management. Whenever a block
 should implement its own
buffer and buffer management. Whenever a block  is read, a uniform number of blocks before
and after block
 is read, a uniform number of blocks before
and after block  should
be retrieved from the disk.
 should
be retrieved from the disk.
Instead of relying only on caching and buffering, we suggest refining the placement. The third and fourth desiderata for a storage manager for graph data are to
(3) minimize the total edge weight in the block graph, and to
(4) minimize the number of edges in the block graph.
Again, we can express the desiderata as
optimization problems over  and
 and  :
:

Notice the similarity between  and
 and  .
.  penalizes edges that join vertices in
different blocks by the distance between blocks.
 penalizes edges that join vertices in
different blocks by the distance between blocks.  penalizes
such edges uniformly. Put differently,
 penalizes
such edges uniformly. Put differently,  rewards
vertices that are stored close to most of their neighbors, while
 rewards
vertices that are stored close to most of their neighbors, while  rewards vertices that are stored in
the same block with most of their neighbors.
 rewards vertices that are stored in
the same block with most of their neighbors. 
Minimizing  tends
to leave short edges between blocks, minimizing
 tends
to leave short edges between blocks, minimizing  tends
to create clusters of vertices in the blocks. Clusters are useful, as they may
enable graph algorithms to run several iterations based on the data in just a
few blocks. Especially DFS algorithms benefit, as they need only a single
unexplored neighbor in each iteration to continue.
 tends
to create clusters of vertices in the blocks. Clusters are useful, as they may
enable graph algorithms to run several iterations based on the data in just a
few blocks. Especially DFS algorithms benefit, as they need only a single
unexplored neighbor in each iteration to continue.
Minimizing  channels
the edges in
 channels
the edges in  to as
little edges in the block graph as possible. The function ignores the number of
edges in
 to as
little edges in the block graph as possible. The function ignores the number of
edges in  that span
across two blocks. The reasoning is the following: Reading a block from the
disk takes a certain amount of time, regardless of how many vertices in the
block are needed in the graph algorithm. It is therefore sensible to try to
reduce the number of blocks that the neighbors of the vertices in a block are
stored in.
 that span
across two blocks. The reasoning is the following: Reading a block from the
disk takes a certain amount of time, regardless of how many vertices in the
block are needed in the graph algorithm. It is therefore sensible to try to
reduce the number of blocks that the neighbors of the vertices in a block are
stored in.
Minimizing  is
a special case of the graph partitioning problem, a fundamental and
well-researched problem in graph theory. Given a graph
 is
a special case of the graph partitioning problem, a fundamental and
well-researched problem in graph theory. Given a graph  ,
its objective is to divide
,
its objective is to divide  into
 into  disjoint subsets,
 disjoint subsets,  ,
such that the number of edges that join vertices from different partitions is
minimal, and
,
such that the number of edges that join vertices from different partitions is
minimal, and  ,
,  .
.  and
 and  are given. Each set
 are given. Each set  is
called a partition.
 is
called a partition.
The graph partitioning problem has been
formulated in several versions. A version for edge-weighted graphs minimizes
not the number but the total weight of edges that cross partitions. A version
for vertex-weighted graphs uses the following as the balancing constraint:  , where
, where  is
the weight of vertex
 is
the weight of vertex  .
In another version, the maximum partition size can be specified per partition.
In [15], the problem is extended for graphs where each vertex has multiple
weights.
.
In another version, the maximum partition size can be specified per partition.
In [15], the problem is extended for graphs where each vertex has multiple
weights.
The problem is relevant to many applications, including parallel scientific computing, task scheduling, and VLSI design. Graph partitioning is used, for instance, in scientific simulations, where computation is performed iteratively on each element of a physical 2D or 3D mesh model (a model of an airfoil, e.g.), to map the mesh elements to processors such that the load is roughly equal and inter-processor communication is minimized.
The graph partitioning problem is NP-complete [8], but because of its relevance, a large number of heuristic algorithms have been developed. A detailed survey with examples can be found in [23].
The graph partitioning problem has
traditionally been approached with recursive bisection. That is, by first
dividing  into two
partitions, then dividing each of these partitions into two partitions, and so
on. In the mid-90s, Karypis and Kumar [13] proposed a multilevel algorithm that
achieved a
 into two
partitions, then dividing each of these partitions into two partitions, and so
on. In the mid-90s, Karypis and Kumar [13] proposed a multilevel algorithm that
achieved a  -way
partitioning in one run. This has since become the second standard approach to
graph partitioning.
-way
partitioning in one run. This has since become the second standard approach to
graph partitioning.
Because most of the proposed algorithms are complex, several authors have published open source ready-to-use tools that partition arbitrary graphs. The two most widely used tools are Chaco version 2.0 (Hendrickson and Leland, published 1995, approx. 30,000 lines of C code [10]) and Metis version 4.0 (Karypis and Kumar, published 1998, approx. 22,000 lines of C code [14]).
The optimization problem for  is not a partitioning problem because
 is not a partitioning problem because  , the number of blocks in the
disk representation, is not known. To use an existing partitioning algorithm,
, the number of blocks in the
disk representation, is not known. To use an existing partitioning algorithm,  has to be guessed. Given
 has to be guessed. Given  , a partitioning algorithm
for vertex-weighted graphs can be used. The weight of each vertex
, a partitioning algorithm
for vertex-weighted graphs can be used. The weight of each vertex  is given by
 is given by  . To get
. To get  on
the right side of the balancing constraint, parameter
 on
the right side of the balancing constraint, parameter  has to be set to
 has to be set to  .
.  must be chosen such that
 must be chosen such that  is non-negative. Choosing
 is non-negative. Choosing  such that
 such that  means that the algorithm cannot
leave any block with free space. This might not be possible. In general,
several different values for
 means that the algorithm cannot
leave any block with free space. This might not be possible. In general,
several different values for  should be tried.
 should be tried.
Unfortunately, most partitioning algorithms
have been designed for a relatively small number of partitions. Chaco
implements only  ,
,  , and
, and  . Metis (algorithm kmetis, default
parameters) has a memory complexity of
. Metis (algorithm kmetis, default
parameters) has a memory complexity of  ,
which makes it impracticable to get a result for large
,
which makes it impracticable to get a result for large  . For reference, a database with 5000 blocks
at 16 KB per block can merely hold 80 MB of data.
. For reference, a database with 5000 blocks
at 16 KB per block can merely hold 80 MB of data.
Finally, existing tools should always be used with care. Metis, for instance, has several cases based on hard-coded parameters built in that allow individual partitions to violate the balancing constraint; that is, grow too large. For our purpose, the balancing constraint has to be strict.
We have yet to tackle the optimization
problem for  . Both Chaco and Metis implement
procedures that try to reduce the number of partitions that each partition is
linked to. Unfortunately, these features are not well documented. However, it
shows that the minimization problems for
. Both Chaco and Metis implement
procedures that try to reduce the number of partitions that each partition is
linked to. Unfortunately, these features are not well documented. However, it
shows that the minimization problems for  and
 and
 can be addressed together.
 can be addressed together.

Figure 7 shows three placements of an
example graph with 16 vertices into disk blocks. It shows the block graph for
each placement, as well as the cost according to  ,
,  , and
, and  .
Boldface indicates the lowest value for each function.
.
Boldface indicates the lowest value for each function.
The example graph is undirected. For simplicity, the block graphs are shown as undirected graphs as well. To determine the values of the cost functions, each edge is treated as two symmetric directed edges with the same weight.
Which of the three placements is
preferred?  And how does it relate to functions  ,
,  , and
, and  ? 
These questions are difficult and we cannot answer them definitively. Which
placement is preferred depends on a number of factors, including the type of
query that is most often run, the characteristics of the graph, and the
available buffer memory. We might prefer alternative (b) for BFS queries
because regardless of which vertex is the source vertex, there will only be two
disk accesses: One to retrieve the block in which the source vertex is stored,
and one to retrieve all remaining blocks. We might prefer alternative (c) if
block 2 can be kept in the buffer permanently.
? 
These questions are difficult and we cannot answer them definitively. Which
placement is preferred depends on a number of factors, including the type of
query that is most often run, the characteristics of the graph, and the
available buffer memory. We might prefer alternative (b) for BFS queries
because regardless of which vertex is the source vertex, there will only be two
disk accesses: One to retrieve the block in which the source vertex is stored,
and one to retrieve all remaining blocks. We might prefer alternative (c) if
block 2 can be kept in the buffer permanently.
A storage manager should combine the cost
functions in an intelligent way. Unfortunately,  ,
,  , and
, and  are
based on different scales and difficult to compare.
 are
based on different scales and difficult to compare.  always
holds. The next chapter describes how G-Store uses the functions.
 always
holds. The next chapter describes how G-Store uses the functions.