Accessing data stored on a disk drive takes on average three orders of magnitude longer than accessing data stored in main memory. It is therefore not surprising that the time for moving blocks of data from the disk to main memory usually dominates query execution time in DBMSs [20].
The storage manager is a component of a database that determines how data is written on and read from the disk drive. We define its primary purpose to be the strategic placement of data on the disk drive such that the average query execution time can be minimal. Of course, execution time depends on other factors as well. Without prejudice to its primary purpose, the storage manager shall store data as compactly as possible on the disk.
In Section 2.2, we presented the mechanic properties of disk drives and showed how the time to service a read or write request can be broken down into data access time and data transfer time. Ignoring head and track switch time and multiple zone recording, data transfer time is just a constant determined by the rotational speed of the disk and the number of sectors per track. That leaves the storage manager with the optimization of data access time; that is, seek time and rotational latency.
To minimize seek time, data items that are frequently accessed together should be placed in the same cylinder, or in cylinders that are as close as possible.
Data placement for minimizing rotational
latency is more complex. In general, data items that are frequently accessed
together should be placed in blocks that are as close as possible on the
circumference. If the items are placed in the same cylinder in different
tracks, offsets for head switch time must be added. If the items are placed in
different cylinders, offsets for seek time or track switch time must be added.
Figure 5 illustrates the complexity of finding rotational latency-minimizing
placements. It shows several locations of data item that, given the placement of data item
and assuming that
is accessed after
, all may yield zero
rotational latency.
Ideally, the storage manager analyzes the disk drive and derives a cost model to evaluate different placements of sets of data items based on access time. Sets of data items that are frequently accessed together are assigned a higher weight in the model.
In practice, storage managers in modern database systems typically do not tackle the specific placement in cylinders, surfaces, and tracks. Instead, they implicitly assume a model of linear storage based on logical block addresses (see Section 2.2 and especially Figure 2). Data that is frequently accessed together is placed either in the same block or in blocks that are as close as possible by their disk addresses. While this might not always yield an optimal data placement, it greatly reduces complexity. The model is hardware-independent, as the mapping of sectors to logical block addresses is supplied by the disk drive.
The storage manager should utilize existing information to find an optimal placement. In a relational database, for instance, all records in a table could be considered items that should be stored together. For several types of data, the storage manager may be able to identify deeper structures. Integrity constraints and foreign key relationships can contain hints.
Graph data by its nature is highly structured and has abundant potential for storage optimization. The most basic desideratum for a storage manager for graph data should be to
(1) store vertices together with incident edges.
We have seen in Sections 3.2 and 4.2 that graph algorithms frequently access the labels and edges of a vertex together. A storage manager for graph data should therefore store them as close as possible. As close as possible means in the same block. If it is not possible to store them in the same block (e.g., because the vertex has more edges than can fit in a block), they should be stored in blocks that are adjacent by their disk addresses.
Henceforth, we take storing a vertex to
mean storing its labels and edges together. We use to
denote the number of bytes that the encoding of an arbitrary vertex
uses on the disk, and we use
to denote the number of bytes in a
block that can hold vertex data.
At the beginning of this section, we
proposed that, without prejudice to the overriding objective of finding an optimal
data placement, the storage manager shall place data as compactly on the disk
as possible. Applied to desideratum (1), the storage manager may store several
vertices in the same block, but it may not break up a vertex into two blocks if
.
In Section 4.1.1, we argued that vertex labels and edges should be represented in separate tables in a relational database. Here, we are thinking about how the two tables should be stored on the disk, which is a decision of the storage manager and hidden from the user. Desideratum (1) suggests that the edges table should be stored grouped on column a and interleaved with the vertices table. This may adversely affect the execution time of simple SELECT queries over one table, but it can increase the execution speed of queries that involve graph navigation.
Graph navigation is at the heart of most graph algorithms. The second desideratum is therefore to
(2) store vertices together with vertices in their neighborhood.
For most graphs, it will not be possible to place each vertex together with its neighbors in the same block or in adjacent blocks. An optimal placement is then one in which as many vertices as possible are stored as close as possible. Let us define an optimization problem:
Let be
the graph that we are trying to store on the disk. If
contains unordered pairs of vertices, convert
them to symmetric directed pairs. Let us assume that
.
Moreover, let us assume that the disk has enough free space to store the entire
graph in consecutive blocks, and let these blocks be numbered 0, 1, …. Let
) be a function that maps the vertices
of
to these blocks. We
use
to denote the set
, and we use
to
denote the sum
.
We propose that an optimal placement by desideratum (2) solves the following problem:
Put differently, let the cost of a particular placement be the sum of the edges that cross block boundaries multiplied by the distance between blocks. An optimal placement by desideratum (2) minimizes this cost.
It is easy to see that if is connected,
can only be minimal if
maps
to consecutive blocks. Even if
contains multiple components, some
queries (e.g., SELECT over
) benefit if the components are stored in
groups of blocks adjacent to each other. Let us therefore redefine
to be a surjective function
, where
is not known.
Our optimization problem is related to a
problem that first appeared in [9], known as the minimum linear arrangement
(MinLA) problem. For a given graph ,
the MinLA problem is to find a bijective function
that
minimizes cost
.
The MinLA problem has been proved to be NP-hard, and its decision problem to be NP-complete [8]. The MinLA problem would be equivalent to our optimization problem if we knew which vertices to put together in blocks, but not the order in which to put the blocks on the disk.
Several heuristic algorithms have been proposed that find near optimal solutions for MinLA problems. Among the most successful are spectral sequencing [11], multilevel-based algorithms [16, 22], a divide-and-conquer algorithm [2], and simulated annealing [19, 18]. Petit [19] compiled a suite of experiments with several of these algorithms. For the largest graph with 10,240 vertices and 30,380 edges, execution times varied between 4 seconds (spectral sequencing) and 12.83 hours (simulated annealing).
MinLA algorithms can be applied to our
placement problem in two ways. Let be the graph that
we are trying to store on the disk:
[A] Create a list of possible mappings
from to consecutive
blocks. Convert each mapping to a graph: Each block is a vertex and each edge
is an edge between the blocks that
and
are mapped to. The graphs may contain loops
and parallel edges.
Solve the MinLA
problem for each graph. The mapping for the graph with the overall lowest , together with its ordering function
define a function
that solves our placement problem.
[B] Solve the MinLA problem for . Set two integer variables
and
to
0. While
do the following:
• Set to be the vertex with
.
• If , increment
.
• Set and increment
.
Suppose the MinLA algorithm was capable of
finding an optimal linear arrangement for every graph. Then approach [A] was
guaranteed to find a function that solved our
placement problem. Unfortunately, the number of possible mappings of
to blocks increases
exponentially with
. A modern computer could
probably not even enumerate all mappings for a large graph. Approach [A] is
therefore not useful for a storage manager.
Approach [B] is not guaranteed to find an
optimal . However, the resulting
is likely to be close to optimal. The
MinLA algorithm must be capable of dealing with very large graphs since it is
applied directly to the input graph.
Figure 6 shows two linear arrangements for
an example graph. The arrangement shown under (a) is the MinLA. The
edge-weighted graph below each arrangement illustrates how the algorithm
described above under [B] would convert the arrangement into a disk
representation. We call such a graph a block graph. Each vertex in a
block graph corresponds to one block on the disk. The block graph contains an
edge if
,
,
,
. The weight of edge
is the cardinality of set
. The block graph preserves enough
information to calculate
. Notice that the
placement derived from the
optimal arrangement
is inferior to the placement shown in (b).
Suppose we could find a placement that
solved the minimization problem for . Would it be
optimal?
assumes that accessing blocks
and
together
takes exactly half of the time to access blocks
and
together.
In practice, deviations from this assumption may be substantial.
Suppose that ,
,
,
and
. By
,
this placement is optimal if
and
cannot be placed in the same block. Assume a
graph traversal algorithm that prints the labels of vertices it finds explores
. After retrieving block
from the disk, it is quite
possible that block
has
just passed under the disk head when the algorithm returns from printing the
label of
. In the worst
case, the algorithm has to wait for a whole rotation of the disk until block
passes under the disk head
again.
The read-ahead functionality of the disk
drive may mitigate the problem. However, read-ahead works only in forward
direction. If and
,
there is no guarantee that block
is in the disk’s cache after block
has been read.
does not differentiate between edge
and edge
.
Moreover, read-ahead is usually only performed when the disk drive’s request queue is empty. If the algorithm requests several different blocks at the same time, read-ahead data may only be available for the last block read.
A storage manager that makes placement
decisions based on should implement its own
buffer and buffer management. Whenever a block
is read, a uniform number of blocks before
and after block
should
be retrieved from the disk.
Instead of relying only on caching and buffering, we suggest refining the placement. The third and fourth desiderata for a storage manager for graph data are to
(3) minimize the total edge weight in the block graph, and to
(4) minimize the number of edges in the block graph.
Again, we can express the desiderata as
optimization problems over and
:
Notice the similarity between and
.
penalizes edges that join vertices in
different blocks by the distance between blocks.
penalizes
such edges uniformly. Put differently,
rewards
vertices that are stored close to most of their neighbors, while
rewards vertices that are stored in
the same block with most of their neighbors.
Minimizing tends
to leave short edges between blocks, minimizing
tends
to create clusters of vertices in the blocks. Clusters are useful, as they may
enable graph algorithms to run several iterations based on the data in just a
few blocks. Especially DFS algorithms benefit, as they need only a single
unexplored neighbor in each iteration to continue.
Minimizing channels
the edges in
to as
little edges in the block graph as possible. The function ignores the number of
edges in
that span
across two blocks. The reasoning is the following: Reading a block from the
disk takes a certain amount of time, regardless of how many vertices in the
block are needed in the graph algorithm. It is therefore sensible to try to
reduce the number of blocks that the neighbors of the vertices in a block are
stored in.
Minimizing is
a special case of the graph partitioning problem, a fundamental and
well-researched problem in graph theory. Given a graph
,
its objective is to divide
into
disjoint subsets,
,
such that the number of edges that join vertices from different partitions is
minimal, and
,
.
and
are given. Each set
is
called a partition.
The graph partitioning problem has been
formulated in several versions. A version for edge-weighted graphs minimizes
not the number but the total weight of edges that cross partitions. A version
for vertex-weighted graphs uses the following as the balancing constraint: , where
is
the weight of vertex
.
In another version, the maximum partition size can be specified per partition.
In [15], the problem is extended for graphs where each vertex has multiple
weights.
The problem is relevant to many applications, including parallel scientific computing, task scheduling, and VLSI design. Graph partitioning is used, for instance, in scientific simulations, where computation is performed iteratively on each element of a physical 2D or 3D mesh model (a model of an airfoil, e.g.), to map the mesh elements to processors such that the load is roughly equal and inter-processor communication is minimized.
The graph partitioning problem is NP-complete [8], but because of its relevance, a large number of heuristic algorithms have been developed. A detailed survey with examples can be found in [23].
The graph partitioning problem has
traditionally been approached with recursive bisection. That is, by first
dividing into two
partitions, then dividing each of these partitions into two partitions, and so
on. In the mid-90s, Karypis and Kumar [13] proposed a multilevel algorithm that
achieved a
-way
partitioning in one run. This has since become the second standard approach to
graph partitioning.
Because most of the proposed algorithms are complex, several authors have published open source ready-to-use tools that partition arbitrary graphs. The two most widely used tools are Chaco version 2.0 (Hendrickson and Leland, published 1995, approx. 30,000 lines of C code [10]) and Metis version 4.0 (Karypis and Kumar, published 1998, approx. 22,000 lines of C code [14]).
The optimization problem for is not a partitioning problem because
, the number of blocks in the
disk representation, is not known. To use an existing partitioning algorithm,
has to be guessed. Given
, a partitioning algorithm
for vertex-weighted graphs can be used. The weight of each vertex
is given by
. To get
on
the right side of the balancing constraint, parameter
has to be set to
.
must be chosen such that
is non-negative. Choosing
such that
means that the algorithm cannot
leave any block with free space. This might not be possible. In general,
several different values for
should be tried.
Unfortunately, most partitioning algorithms
have been designed for a relatively small number of partitions. Chaco
implements only ,
, and
. Metis (algorithm kmetis, default
parameters) has a memory complexity of
,
which makes it impracticable to get a result for large
. For reference, a database with 5000 blocks
at 16 KB per block can merely hold 80 MB of data.
Finally, existing tools should always be used with care. Metis, for instance, has several cases based on hard-coded parameters built in that allow individual partitions to violate the balancing constraint; that is, grow too large. For our purpose, the balancing constraint has to be strict.
We have yet to tackle the optimization
problem for . Both Chaco and Metis implement
procedures that try to reduce the number of partitions that each partition is
linked to. Unfortunately, these features are not well documented. However, it
shows that the minimization problems for
and
can be addressed together.
Figure 7 shows three placements of an
example graph with 16 vertices into disk blocks. It shows the block graph for
each placement, as well as the cost according to ,
, and
.
Boldface indicates the lowest value for each function.
The example graph is undirected. For simplicity, the block graphs are shown as undirected graphs as well. To determine the values of the cost functions, each edge is treated as two symmetric directed edges with the same weight.
Which of the three placements is
preferred? And how does it relate to functions ,
, and
?
These questions are difficult and we cannot answer them definitively. Which
placement is preferred depends on a number of factors, including the type of
query that is most often run, the characteristics of the graph, and the
available buffer memory. We might prefer alternative (b) for BFS queries
because regardless of which vertex is the source vertex, there will only be two
disk accesses: One to retrieve the block in which the source vertex is stored,
and one to retrieve all remaining blocks. We might prefer alternative (c) if
block 2 can be kept in the buffer permanently.
A storage manager should combine the cost
functions in an intelligent way. Unfortunately, ,
, and
are
based on different scales and difficult to compare.
always
holds. The next chapter describes how G-Store uses the functions.