Saturday, September 15, 2012

Finding Shortest Path in a Graph inside PostgreSQL Database

@cegme and I recently tackled this interesting problem of finding the shortest route between two nodes in a graph from inside a database.

We have a list of research papers and their citations which we call "links". The idea is to start from a given paper and follow its links recursively till reach the specified destination paper. This is a typical graph problem and can be solved using well known shortest-path finding algorithms. However our data was in the database consequently we wanted to solve the problem using SQL queries which was a challenge.

We solved this problem inside the database using WITH queries (Common Table Expressions) which was introduced in PostgreSQL version 8.4. The following extract explains WITH keyword.
WITH provides a way to write subqueries for use in a larger SELECT query. The subqueries, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for this query. One use of this feature is to break down complicated queries into simpler parts.

This is how we went about solving the problem:

We have the following table (in a PostgresSQL database):
Reference(pid integer, citation integer);         // paper "pid" cites paper "citation"

We created a User Defined Function (UDF) called:
search_path(start integer, destination integer)

Where start is the paper id of the starting paper and destination is the paper id of the last paper.


Detailed Explanation

Let me explain what happens in this UDF:

1. We declare a UDF :
CREATE OR REPLACE FUNCTION search_path(start integer, destination integer)

2. It returns a table.
RETURNS TABLE(depth integer, path integer[]) AS $$

3. We use the WITH RECURSIVE keyword available in Postgres and form a table containing the starting paper (id), the next paper in the path (link), the length of the path (depth), the list of traversed paper (route), and a boolean cycle variable to keep a note of whether we are forming a cycle or not.
WITH RECURSIVE search_paper(id, link, data, depth, route, cycle)

4. We start off with the starting paper id. This is the base case and gets executed only once.
SELECT r.pid, r.citation, 'data', 1,
ARRAY[r.pid],
false
FROM reference r WHERE r.pid=start


5. We recursively add the next link in the path, check if a cycle is formed or not, increase the depth. If a cycle is formed, we immediately exit the loop. This is controlled by the cycle variable. This piece of code gets executed multiple times until cycle != true.
SELECT r.pid, r.citation, 'data', sp.depth+1, route || r.pid,
r.pid = ANY(route)
FROM reference r, search_paper sp
WHERE r.pid = sp.link AND NOT cycle

6. Finally if we hit our destination paper id in the search, we exit the loop. We only select the depth and the route (as an array of integers) and display it. Note that we order the result with increasing depth since we may obtain more than one route.
SELECT sp.depth, (sp.route || sp.link) AS route FROM search_paper AS sp
WHERE link = destination AND NOT cycle ORDER BY depth ASC;

No comments: