Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.
As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason - there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.
Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.
There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.
It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.
Consider the following setup:
create table if not exists items( id int primary key, component_id int ); create table if not exists links( first int references items (id), second int references items (id) );
We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.
We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):
create table if not exists components_to_merge( component1 int, component2 int);
(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).
The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.
So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.
Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:
insert into components_to_merge
select distinct t1.component_id c1, t2.component_id c2
from links
join items t1
on links.first = t1.id
join items t2
on links.second = t2.id
where t1.component_id != t2.component_id
insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity
So, now we have a list of groups to merge in this table. If the table is empty then we’re done - all the groups are maximal (and because of the way we constructed them they’re connected - at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:
update items
join (
select component1 source, min(component2) target
from components_to_merge
group by source
) new_components
on new_components.source = component_id
set items.component_id = least(items.component_id, target)
This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).
What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.
I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine - it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.
Anyway, here’s some sample code that ties all of this together: http://code.trampolinesystems.com/components.rb

November 19th, 2008 at 10:03 am
Awesome!
November 19th, 2008 at 5:18 pm
I don’t have much experience doing this stuff, but I have a use-case coming up that will call for it. So I found an interesting book by Joe Celko that deals with the issue of modeling trees and graphs in SQL. If you haven’t seen it yet, you may find something of interest there:
http://www.amazon.com/Hierarchies-Smarties-Kaufmann-Management-Systems/dp/1558609202
November 20th, 2008 at 9:34 am
Thanks Phillip. I’m aware of the book and keep meaning to read it, but I don’t think it’s actually relevant to this particular problem: It focuses primarily on directed acyclic graphs, whileas this is an undirected cyclic one.
November 21st, 2008 at 8:22 pm
I am a bit slow and am not 100% sure of the goal. Is the idea to consider all nodes with the same external connections (i.e. to everyone but each other) as the same node?
November 23rd, 2008 at 5:05 pm
No, the idea is that we want to find communicating groups.
E.g. suppose A talks to B and B talks to C, but none of them talk to D, even though D talks to E (this isn’t actually how we use this, but it illustrates the point). We want to find the two groups
{A, B, C} and {D, E}.
i.e. despite the fact that A never talks to C directly, C is part of the group of people with which A communicates. However there are no communications reaching from A to D or E, so D and E are in a different group.
What this algorithm does is it assigns a unique number to each person such that people with the same number are in the same group of communications.
November 24th, 2008 at 6:47 am
the underlying concern of storing the graph and the merge results (ie. the communicating groups) in the database can still be satisfied while performing the actual merge (transitive entailment) outside the db with an in-memory solution. it should be blazing fast ( groups * log (<n) ) too. btw, doesn’t your select on min(component2) create a table scan?
November 24th, 2008 at 10:34 am
Well the problem isn’t the performance of the in memory solution so much as the fact that it’s in memory. The graph in question is really on the largish side (I mean sure, it’s only a few tens of thousands of nodes at the maximum in the data sets I’m running with at the moment, but that will easily bump up an order of magnitude or three in a production system). Loading all of that into memory is asking for trouble.
I don’t think the select min(component2) creates a full table scan as long as component2 is sensibly indexed. I’d have to take a look at the explain to be sure though. I didn’t talk about indices in this article, but the example code I linked to should have the right ones.
November 24th, 2008 at 2:34 pm
yah, that’s a fairly large set of object to represent in ruby - especially if you are on a fixed slice. i’m pretty sure the linked-in guys keep their graph in memory, of course they are on the jvm. “explain” command will tell you how mysql plans to process that query. http://dev.mysql.com/doc/refman/5.0/en/explain.html
November 24th, 2008 at 2:42 pm
Yes, I know it will. That’s why I said I’d have to take a look at the explain again. :-) I just wasn’t in front of a database at the time. And explain confirms that it uses the index rather than a table scan.
We could probably keep the graph in memory but I’m not really convinced it’s a good idea, particularly given that we can run this as a once a day process. Maybe at some point it will become a better idea, but for now this seems preferable.
November 24th, 2008 at 7:30 pm
sorry - explain and example read the same in haste ;)
i foresee ACID’able subgraphs in our 64bit, large memory future, or at the very least a specific case of stm for graphs.
November 24th, 2008 at 10:36 pm
Well, there are graph databases. We’re just not using one for various reasons. :-) (Although most of the existing graph databases don’t look like they’d make operations like “find all connected components” particularly easy)
January 11th, 2010 at 5:36 am
Hey, I found your blog while searching on Google your post looks very interesting for me. I will add a backlink and bookmark your site. Keep up the good work!
I’m Out! :)