Many modern applications are based on graph data. Social networks, for instance, are based on graphs that describe relationships among people. Paths of disease outbreaks form a graph, as do airline routes, and citations among academic papers. These graphs often contain a massive amount of data.
Relational databases are today’s system of choice for storing and querying large amounts of data. Their use has been backed up by decades of research and commercially successful database management systems such as Oracle, IBM DB2, and PostgreSQL. Relational databases are extremely fast in finding, filtering, inserting, deleting, and updating information in a database table. Joining information from several tables, on the other hand, often takes orders of magnitude longer.
From a theoretical point of view, the relational database model is not a good fit for representing highly interconnected data. Regardless, as businesses and processes around the world get more interconnected, relational databases are increasingly used for storing such data. Perhaps one reason is the lack of a suitable, stable, and well-supported alternative.
G-Store is a prototype of a storage manager for large vertex-labeled graphs. It has been designed and implemented between May and August 2010 and consists of approximately 12,000 lines of C/C++ code. G-Store exploits the structure of the graph to derive a data placement on disk that is optimized for access patterns found in graph queries. The placement strategy is based on a multilevel algorithm that partitions the graph into blocks and arranges these blocks on disk to minimize the distance on disk between adjacent vertices. G-Store has a built-in query engine that supports depth-first traversal, reachability testing, shortest path search, and shortest path tree search.
This thesis is organized as follows:
• Chapter 2 introduces definitions and concepts that are used throughout the thesis. Section 2.1 is a primer on graph theory, Section 2.2 describes the memory levels in a computer, and Section 2.3 presents the mechanic properties of disk drives.
• Chapters 3 and 4 discuss traditional approaches to managing graph data. Chapter 3 focuses on main memory, Chapter 4 focuses on relational databases. Several aspects of the discussion are directly relevant to G-Store and the remaining thesis. Notably, Sections 3.2.1 and 3.2.2 introduce depth-first search and breadth-first search.
• Chapter 5 defines the role of the storage manager and develops the theoretical foundation for G-Store’s storage algorithm.
• Chapter 6 presents G-Store’s storage algorithm, the main contribution of G-Store.
• Chapter 7 presents important parts of G-Store’s query engine and shows how G-Store is used.
• Chapter 8 shows the results of first experiments with G-Store.
• Chapter 9 concludes the thesis and looks out to G-Store version 0.2.
Apart from the discussion of G-Store and its theoretical foundation, this thesis contributes:
• A survey and comparison of important main memory graph data structures (Section 3.1).
• An idea for a modification of the adjacency lists data structure that improves upon its asymptotic complexity for a specific graph operation (Section 3.1.3).
• An idea for representing an adjacency matrix in a relational database (Section 4.1.2).
• A survey of queries for traversing a graph stored in a relational database (Section 4.2).
• Block-first search, an algorithm for testing reachability in G-Store (Section 7.6).