The lack of advanced data types and flexible program control greatly reduces the number of viable graph representations in a relational database. The edge list data structure is the de facto standard. It is introduced in Section 4.1.1. Section 4.1.2 presents an attempt to create an adjacency matrix in a relational database.
The social network graph shown below is
used as an example in this chapter. Vertices represent people and edges mark
who-knows-whom. Each vertex is labeled with a unique identifier, a person’s
name, and his age. Edges are not labeled and parallel edges and loops are not
allowed.
The edge list data structure is the de facto standard for storing graphs in a relational database. It is the database equivalent of the unordered edge list (see Section 3.1.1). Since a table in a relational database by definition stores an unordered set of records, the term “unordered” is redundant.
Several aspects should be considered when setting up the data structure: First, to comply with the domain/key normal form, vertex labels should be represented separately from edges. If vertex labels and edges were represented in the same table, complex constraints would be necessary to enforce data consistency.
The primary key for the vertices table is the unique vertex identifier. The edges table has two columns holding foreign keys. Since parallel edges are not allowed in the example graph, the columns can be a joined primary key for the edges table. Additionally, this prevents the user from illegally inserting an edge that already exists in the graph. We should add another constraint to prevent the insertion of loops.
Finally, both tables should be properly supported by indexes. Most database management systems (DBMSs) automatically create indexes on primary key columns to enforce primary key uniqueness. To speed up graph navigation, an index on each column of the edges relation should be added. To speed up the creation of the graph, it is advisable to add the indexes after the table has been filled.
In PostgreSQL, the data structure can be created as follows.
CREATE TABLE vertices(
id INTEGER PRIMARY KEY,
name VARCHAR(20),
age SMALLINT
);
CREATE TABLE edges(
a INTEGER NOT NULL REFERENCES vertices(id),
b INTEGER NOT NULL REFERENCES vertices(id),
PRIMARY KEY (a, b),
CONSTRAINT no_loops CHECK (a != b)
);
CREATE INDEX idx_a ON edges(a);
CREATE INDEX idx_b ON edges(b);
After inserting the example social network, finding who in the network is known to ‘Robert’ is a simple task:
Output A B ID NAME AGE
1 3 3 Erich 18
1 4 4 Jacques 62
SELECT * FROM edges JOIN vertices ON b = id
WHERE a = (SELECT id FROM vertices WHERE name = 'Robert');
The relational adjacency matrix is an attempt to create an adjacency matrix (see Section 3.1.4) in a relational database. Just like a matrix, a table in a relational database consists of rows and columns. The only structural difference is that the rows in a database table are identified by primary keys, which can have an arbitrary type and value, while the rows in a matrix are identified by consecutive integers. We can overcome this difference by choosing appropriate primary keys.
There a several aspects to consider: First, since it is not possible in SQL to select columns based on values in rows, we should transpose the adjacency matrix. In other words, all edges incident on vertex should be stored in column instead of in row .
Second, some DBMSs, including Oracle, do not supply a data type for representing booleans. In such systems, a CHAR(1) type constrained to ‘0’ and ‘1’ can be used. PostgreSQL does supply a boolean type, BOOLEAN. It uses one byte of storage.
Finally, most s limit the number of columns a table can have. PostgreSQL’s limit lies between 250 and 1600, depending on the types of the columns. Oracle’s limit is 1000. As a solution, we suggest storing several columns of the matrix in a character string, and using string functions to filter for a specific column. If the SQL function SUBSTR() is implemented efficiently, an individual character in a character string can be accessed in constant time.
Which data type to use depends on the DBMS and the size of the graph. PostgreSQL’s TEXT data type can store character strings of unlimited length. Oracle’s VARCHAR data type can store up to 4000 characters. Oracle’s CLOB (“Character Large Object”) data type can store up to 4 GB of text. This data type has special string functions (e.g., DBMS_LOB.SUBSTR()).
Experiments should determine if the columns of the adjacency matrix are better represented in one large column or in multiple smaller columns. To create a data structure for the example graph, we use a single column and PostgreSQL’s TEXT data type:
CREATE TABLE adjmat(
id INTEGER PRIMARY KEY,
cols TEXT
);
--01234
INSERT INTO adjmat VALUES(0,'00000');
INSERT INTO adjmat VALUES(1,'10100'); --(0,1), (2,1)
INSERT INTO adjmat VALUES(2,'00010'); --(3,2)
INSERT INTO adjmat VALUES(3,'01000'); --(1,3)
INSERT INTO adjmat VALUES(4,'01100'); --(1,4), (2,4)
The following query finds the neighbors of ‘Robert’:
Output ID COLS NAME AGE
3 01000 Erich 18
4 01100 Jacques 62
SELECT * FROM adjmat NATURAL JOIN vertices
WHERE SUBSTR(cols, 1 + (SELECT id FROM vertices
WHERE name = 'Robert'), 1) = '1';
The relational adjacency matrix is an experimental data structure. Since it does not rely on joins for graph navigation, we think that it may have the potential to perform better than the edge list data structure in some applications. Procedural language extensions for SQL may have to be used to take advantage of the relational adjacency matrix’ capabilities.
In this section, we present four approaches for traversing a graph stored in an edge list data structure in a relational database. In main memory, we could utilize the manifold functionality of modern programming languages. In a relational database, we are bound to the functionality of SQL and the DBMS. The example queries in this section have been created in PostgreSQL 8.4, except where stated otherwise.
The simplest way to navigate breadth-first in a graph is to repeatedly join the edges relation on itself. Every join corresponds to a run of the outermost for loop in the BFS algorithm presented in Section 3.2.1. With a single join, vertices at level 2 are reached. With two joins, vertices at level 3 are reached, and so on. The query below shows how joins can be used to find a list of people known to people known to ‘Robert’. The columns A A B contain the identifiers of the vertices along the paths.
Output ID NAME AGE A A B
2 Edward 32 1 3 2
SELECT * FROM vertices JOIN
(SELECT lvl_1.a, lvl_2.a, lvl_2.b FROM edges lvl_1
JOIN edges lvl_2 ON lvl_1.b = lvl_2.a
WHERE lvl_1.a = (SELECT id FROM vertices
WHERE name = 'Robert')) lvl_2
ON id = lvl_2.b;
The query returns a list of people at the target level. The query below includes each person along each path. A person’s name is padded with spaces according to his level in the path. The three common table expressions (WITH ...) are used to prevent the DMBS from retrieving the same path multiple times, and to make the query more readable.
Output NAME AGE
Robert 39
Erich 18
Jacques 62
Edward 32
WITH lvl_1 AS (SELECT id FROM vertices WHERE name = 'Robert'),
lvl_2 AS (SELECT b FROM edges
WHERE a = (SELECT * FROM lvl_1)),
lvl_3 AS (SELECT b FROM edges
WHERE a IN (SELECT * FROM lvl_2))
SELECT LPAD(' ', lvl) || name AS name, age FROM
(SELECT 0 lvl, name, age FROM vertices
WHERE id = (SELECT * FROM lvl_1)
UNION ALL
SELECT 1 lvl, name, age FROM vertices
WHERE id IN (SELECT * FROM lvl_2)
UNION ALL
SELECT 2 lvl, name, age FROM vertices
WHERE id IN (SELECT * FROM lvl_3)) q
ORDER BY lvl;
Notice how the order of the output rows might lead us to believe that ‘Robert’ was connected to ‘Edward’ through ‘Jacques’. DFS can find a list with a more intuitive order, as we shall see below.
When navigating short distances in a graph, manual joins may well be used. For larger distances, queries with manual joins become very large and error-prone. One solution is to install a temporary table that holds frequently visited paths or the transitive closure of the graph. A disadvantage of such a table is that it has to be maintained and may use a considerable amount of space. Materialized views may mitigate the maintenance issue. Transitive reduction [1] may help to reduce the size of such a table.
Recursive common table expressions (CTEs) are a relatively new feature in the SQL standard. PostgreSQL added support in version 8.4, Oracle in version 11g. Strictly speaking, recursive CTEs implement iteration not recursion, and Oracle’s version of recursive CTEs also does not use the keyword RECURSIVE.
A recursive CTE consists of two parts, separated by a UNION or UNION ALL. The first part generates the initial set of rows. The second part is the recursive term and can contain a reference to its own output. The query below shows how a recursive CTE can be utilized to find all paths in the example graph. Notice the mechanism that prevents the query from falling into an infinite recursion:
Output ROOT DEST LEN PATH CYCLE
0 1 1 {0,1} f
0 1 4 {0,1,3,2,1} t
0 2 3 {0,1,3,2} f
0 3 2 {0,1,3} f
0 4 2 {0,1,4} f
0 4 4 {0,1,3,2,4} f
1 1 3 {1,3,2,1} t
1 2 2 {1,3,2} f
... (21 rows in total)
WITH RECURSIVE all_paths(root, dest, len, path, cycle) AS(
SELECT a, b, 1, ARRAY[a] || b, FALSE FROM edges
UNION ALL
SELECT root, b, len + 1, path || b, b = ANY(path)
FROM edges JOIN all_paths ON a = dest
WHERE NOT cycle)
SELECT * FROM all_paths
ORDER BY root, dest, len;
When the ORDER BY clause is omitted, the query returns output rows in non-decreasing order of LEN. Although SQL does not guarantee any order without ORDER BY, the natural order of the output is often an indication of how a query is executed by the DBMS. In this case, it is an indication of BFS.
The following query solves the single-source shortest path problem:
SELECT root, dest, MIN(len), path, cycle FROM all_paths
GROUP BY root, dest, path, cycle
ORDER BY root, dest;
To find paths for a specific source vertex, a WHERE filter can be added in the first part of the recursive CTE. If the filter is added outside of the CTE, the DBMS could retrieve all paths first, and then filter for the specified source vertex.
To limit the search to paths of a given length, a WHERE filter can be added in the second part of the recursive CTE. Again, if the filter is added outside of the CTE, the DBMS could apply it after retrieving all paths.
We are not aware of a technique to stop a recursive CTE as soon as a path has been found. This would be useful for answering reachability queries or shortest path queries in large graphs.
The query below finds all paths between ‘Alan’ and ‘Jacques’ with a length of at most 3. Notice how this query is more portable than the previous query, as it does not use the PostgreSQL-specific data types ARRAY and BOOLEAN:
Output ROOT DEST LEN PATH CYCLE
0 4 2 -0-1-4 0
WITH RECURSIVE all_paths(root, dest, len, path, cycle) AS(
SELECT a, b, 1, '-' || a || '-' || b, '0' FROM edges
WHERE a = (SELECT id FROM vertices WHERE name = 'Alan')
UNION ALL
SELECT root, b, len + 1, path || '-' || b,
CASE WHEN path LIKE '%-' || b || '%' THEN '1' ELSE '0' END
FROM edges JOIN all_paths ON a = dest
WHERE NOT cycle = '1' AND len + 1 <= 3)
SELECT * FROM all_paths
WHERE dest = (SELECT id FROM vertices WHERE name = 'Jacques')
Depth-first traversal is more difficult in a relational database. In this section, we present a recursive approach based on PL/pgSQL, PostgreSQL’s procedural extension language for SQL. PL/pgSQL is very similar to PL/SQL, Oracle’s procedural SQL extension, and the functions shown in this section can be easily converted.
Similar to the C++ implementation presented in Section 3.2.2, the PL/pgSQL implementation consists of two functions: An initialization function dfs(), and a recursive function dfs1(). The functions use two additional database tables: A results table that must be created by the user, dfs_results, and a cycle detection table that is automatically created and deleted, dfs_tmp.
The structure of dfs_result depends on the purpose of DFS. To find a hierarchical list of people that can be reached from a given person in the example social network, dfs_result can be created as follows:
CREATE TABLE dfs_result(
rownum SERIAL,
id INTEGER NOT NULL REFERENCES vertices(id),
level INTEGER,
cycle BOOLEAN
);
SERIAL is an integer data type in PostgreSQL that is automatically incremented with each new record in a table. We use it to keep track of the order in which records are inserted.
SELECT dfs(1); starts the search for source vertex ‘Robert’. 1 is the unique identifier for ‘Robert’. After the function returns, the user has to join dfs_results on the vertices table:
Output NAME AGE CYCLE
Robert 39 f
Erich 18 f
Edward 32 f
Jacques 62 f
Robert 39 t
Jacques 62 f
SELECT LPAD(' ', level) || name AS name, age, cycle
FROM dfs_result NATURAL JOIN vertices
ORDER BY rownum;
The list has an intuitive order. It tells us, for instance, that ‘Jacques’ is connected to ‘Robert’ both directly and through ‘Erich’ and ‘Edward’. The PL/pgSQL functions are:
CREATE FUNCTION dfs1(me int, lvl int) RETURNS void AS $$
DECLARE
c RECORD;
BEGIN
UPDATE dfs_tmp SET active = TRUE WHERE id = me;
FOR c IN
SELECT b, active FROM edges JOIN dfs_tmp ON b = id
WHERE a = me
LOOP
IF c.active = TRUE THEN
INSERT INTO dfs_result (id, level, cycle)
VALUES (c.b, lvl + 1, TRUE);
ELSE
INSERT INTO dfs_result (id, level, cycle)
VALUES (c.b, lvl + 1, FALSE);
PERFORM dfs1(c.b, lvl + 1);
END IF;
END LOOP;
UPDATE dfs_tmp SET active = FALSE WHERE id = me;
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION dfs(s int) RETURNS void AS $$
BEGIN
DELETE FROM dfs_result;
CREATE TABLE dfs_tmp AS SELECT id, FALSE active
FROM vertices;
INSERT INTO dfs_result (id, level, cycle)
VALUES (s, 0, FALSE);
PERFORM dfs1(s, 0);
DROP TABLE dfs_tmp;
END;
$$ LANGUAGE plpgsql;
The CONNECT BY clause is an SQL extension for depth-first tree and graph traversal in Oracle and EnterpriseDB databases. Before recursive CTEs were introduced, it was by far the most convenient method for navigating in graphs stored in a relational database. The query below has been created in an Oracle 11g databases and shows how CONNECT BY can be used to find all paths with source ‘Robert’.
Output S D NAME A C L PATH
1 3 Erich 18 0 0 Robert-Erich
1 2 Edward 32 0 0 Robert-Erich-Edward
1 1 Robert 39 1 1 Robert-Erich-Edward-Robert
1 4 Jacques 62 0 1 Robert-Erich-Edward-Robert-Jacques
1 4 Jacques 62 0 1 Robert-Erich-Edward-Jacques
1 4 Jacques 62 0 1 Robert-Jacques
SELECT CONNECT_BY_ROOT a s, b d,
LPAD(' ', LEVEL) || name name, age a,
CONNECT_BY_ISCYCLE c, CONNECT_BY_ISLEAF l,
'Robert' || SYS_CONNECT_BY_PATH(name, '-') path
FROM edges JOIN vertices ON b = id
START WITH a = (SELECT id FROM vertices WHERE name = 'Robert')
CONNECT BY NOCYCLE PRIOR b = a;
Documentation of the clause and its associated pseudocolumns is included with every Oracle database. Notice the CONNECT_BY_ISLEAF pseudocolumn. It shows ‘1’ if no unexplored vertex is found during the scan of the neighborhood of a vertex.
Compare the output of this query with the output of the PL/pgSQL query in the previous section. Notice how the PL/pgSQL query stopped when it detected a cycle in path Robert-Erich-Edward-Robert, while the CONNECT BY query continued and found path Robert-Erich-Edward-Robert-Jacques.
To our knowledge, it is not possible to limit the traversal depth of a CONNECT BY query in Oracle. A WHERE condition on the LEVEL pseudocolumn is evaluated after traversal. This is different in G-Store’s query engine.