Blog | Archive for November, 2008
rails 2.2 + jruby + jetty = win
By jan | Thursday, November 27th, 2008
In case you missed it, rails 2.2 recently got released, finally promising thread safety among some other things. Thread safety has always been neglected by the rails core team, the standard way to scale up in rails (pre 2.2) is to run multiple processes, which makes deployment a lot harder (I think there’re at least 10 different ways to deploy rails apps at the moment, and people still come up with new solutions: apache+fcgi, mongrel, mongrel_cluster, thin, phusion, rack…).Why has thread safety become a priority all of a sudden? I suspect one of the drivers is JRuby, which is now a viable alternative to MRI Ruby, and which also has the nice property of mapping Ruby threads to native threads. Another factor might be the arrival of merb, the new kid on the ‘ruby web framework’ block. Merb has been designed with thread safety in mind, and is now starting to get a lot of attention (1.0 has just been released).
Now, with a thread safe rails JRuby might become the platform of choice for deploying rails apps, especially given the performance progress the JRuby team is making. Having real threads does make a huge difference, reducing the memory footprint and making better use of multi core cpus.
There’re a couple of possibilites to deploy a rails application in JRuby, glassfish seems to be the recommended choice at the moment. However glassfish is anything but easily embeddable so I tried jetty as an option. Compared to glassfish, jetty is solid and proven (version 7 will be released soon), small and easily embeddable.
I didn’t want to use warbler (no web.xml please!), instead I used a combination of JRuby-Rack + Jetty7 and tied everything together with a simple JRuby script.
server = org.mortbay.jetty.Server.new
thread_pool = org.mortbay.thread.QueuedThreadPool.new
thread_pool.min_threads = 5 # adjust as needed
thread_pool.max_threads = 50
server.set_thread_pool(thread_pool)
connector = org.mortbay.jetty.nio.SelectChannelConnector.new
connector.port = 3000
context = org.mortbay.jetty.servlet.Context.new(nil, "/",
org.mortbay.jetty.servlet.Context::NO_SESSIONS)
context.add_filter("org.jruby.rack.RackFilter", "/*",
org.mortbay.jetty.Handler::DEFAULT)
context.set_resource_base(RAILS_DIR)
context.add_event_listener(org.jruby.rack.rails.RailsServletContextListener.new)
context.set_init_params(java.util.HashMap.new(
'rails.root'=> '.', 'public.root' => 'public',
'org.mortbay.jetty.servlet.Default.relativeResourceBase' => '/public',
'jruby.max.runtimes' => '1'))
context.add_servlet(org.mortbay.jetty.servlet.ServletHolder.new(
org.mortbay.jetty.servlet.DefaultServlet.new), "/")
server.set_handler(context)
server.start
This will run jetty on port 3000, dispatching all requests for dynamic content to a single JRuby instance. It is important to set “‘jruby.max.runtimes” to 1, so it’ll create a shared application runtime for you, otherwise you’ll get the old one runtime per thread model.
On the rails side you need “config.threadsafe!” in the configuration file. Autoloading of classes will then be disabled, be sure to load all your dependencies upfront in environment.rb. We haven’t actually used this in production, but some initial tests look very promising (mongrel: 23req/s, jetty: 50 req/s). Also, deployment will be a lot easier, because static and dynamic content can be served by one single process.
Posted in Coding | 26 Comments »
Calculating Pearson's correlation coefficient in SQL
By david | Sunday, November 23rd, 2008
Apparently I’m crazy.Well, it’s not the first time this accusation has been levelled. But it’s one of the more amusing ones.
My colleague Jan Berkel was at Ruby Manor, where one of the talks was on acts_as_recommendable. Here’s what he had to say on the subject:
a plugin for rails to generate recommendations. lets you do things like user.similar_users / books.similar_books etc. it’s based on pearson correlation. the guy mentioned he ran into memory problems using large datasets for doing the computation (load all objects , calculate, store objects). i asked whether they tried performing the calculation on SQL level to avoid runtrips to the db (the way david implemented it). he first didn’t understand what i was saying, then someone from the audience said “you’d have to be crazy to do that”… i guess this elegant solution is not the ruby way.. anyway.
This entertained me, as one of the things I introduced recently to our theme extraction process is using pearson’s coefficient to rate similarity of themes. It never even occurred to me that loading it all to the client side doing it in Ruby and then saving it back might be considered an option.
Pearsons is very far from being the most powerful metric in the world, but remarkably useful for all that: It gives a decent scaled measure of similarity and is cheap and easy to calculate.
Ok. Well. Easy to calculate. And cheap to calculate if you don’t do clever things like loading the entire data set into memory. Anyway…
All joking aside, it wasn’t entirely trivial to do, and clearly this is something which is of interest to more people than us, so this post is an explanation of how it works.
The Pearson product-moment corellation coefficient of two distributions is a measurement of their similarity. Given random variables X, Y we define the covariance of the two to be:
Cov(X, Y) = E(XY) – E(X) E(Y)
The reason for this slightly strange looking expression is that random variables have the nice property that for two independent random variables E(XY) = E(X) E(Y). So if X and Y are independent then Cov(X, Y) = E(XY) – E(X) E(Y) = E(X) E(Y) – E(X) E(Y) = 0. The converse isn’t generally true (it’s true under some circumstances), but nevertheless we can use Cov(X, Y) as a cheap measure of degree of independents.
It has some other nice properties: Suppose Y = aX. Then Cov(X, Y) = a Var(X). So if Y is a positive multiple of X then the covariance is positive, if Y is a negative multiple of X the covariance is negative.
This also shows us that there’s no inherent significance to the size of the covariance. It depends on the variance of both X and Y – in fact, scaling either will increase the covariance proportionately.
However it does have the following nice property:
|Cov(X, Y)| <= StdDev(X) StdDev(Y)
(When one is a multiple of the other, equality is achieved).
This allows us to define the pearson’s coefficient (normally written rho):
rho(X, Y) = Cov(X, Y) / (StdDev(X) * StdDev(Y))
Which by the above property takes values between -1 and 1 (it will take value 1 when one is a positive multiple of the other and value -1 when one is a negative multiple of the other).
That’s all there is to it. It’s a nice scaled measure of how similar the two distributions are.
So, why do we care and how do we use this?
Well, the details of how we actually use this are cunning and proprietary. So I can’t really tell you those (I’m hoping that we may be able to disclose some of them at some point, but not just yet). So let’s pretend we’re trying to win the netflix challenge (I think some solutions actually do use this, but this example has the nice property of having absolutely nothing to do with how we use it in SONAR). We have movies, we have people, we have ratings of movies by people. Given two movies we want to determine how similar they are, based on the people who liked them.
Sound interesting now?
We can do this by considering the ratings of each movie as a random variable. A person is a sample of this rating space. So by considering the set of all people as a sample we can calculate the pearsons corellation over this sample (I know I only explained how it works for random variables, but it carries over in the obvious way for samples).
So we’ve got a setup: People, movies, ratings. We’re going to calculate the correlation between movies and store the results in the database (I’d like to use views for this, but MySQL does horrible bad things to view join performance).
Before I proceed, there are a few subtle points:
- Not everyone has rated every movie. I’m going to treat movies which have not been rated as implicitly having a rating of 0. This is sortof justified: Not watching a movie is often a stance about what sort of movie you like. Mostly it just makes the maths convenient.
- I’m going to ignore pairs of movies which have no rating in common. These have a corellation which is easy to work out (rho(x, y) = – mean(x) * mean(y) / (std_dev(x), std_dev(y))), and there tend to be a lot of them. This simplifies calculations and reduces storage overhead. Heavily anticorrelated movies also tend to be more because of holes in the data than interesting facts.
So let’s imagine our schema is set up as follows:
create table people( id int primary key ) create table movies( id int primary key ) create table ratings( person int references people (id), movie int references movies (id), rating float not null, unique key (person, movie) )
We start out by computing mean and standard deviations for movies. Normally we’d use the built in functions for this, but unfortunately because we’ve got all those implicit 0 ratings floating around we have to do it ourselves.
Here’s the table in which we’ll store per movie stats:
create table movie_statistics(
movie int primary key references movies(id),
mean float,
stddev float
)
We’ll only actually include stats for movies which have been rated at some point.
First we’ll calculate the mean:
insert into movie_statistics (movie, mean) select movie, sum(rating) / (select count(*) from people) mean from ratings group by movie
Then the standard deviation:
update movie_statistics
set stddev = (
select sqrt(
sum(rating * rating) / (select count(*) from people)
- mean * mean
) stddev
from ratings
where ratings.movie = movie_statistics.movie
group by ratings.movie)
All perfectly straightforward, if a little annoyingly wordy.
Now we get to the actual interesting bit: Calculating correlations. First we create a table in which to store the results:
create table movie_correlations(
movie1 int references movies(id),
movie2 int references movies(id),
rho float,
unique key (movie1, movie2),
key (movie1),
key (movie2)
)
Now we need to populate it. This will proceed in essentially three parts. First, we need to calculate E(XY), so for every pair of co-occurring movies we’ll get a sum of their products over all people who have rated them. That’s Sum(XY). We then divide by the number of people to get the average value of XY (implicit 0s again). Then we’ll use our already calculated statistics to give us the covariance (E(XY) – E(X) * E(Y)), and then the correlation (divide by standard deviations).Or, to put it altogether in one massive great big query:
insert into movie_correlations (
movie1,
movie2,
rho
)
select sf.m1,
sf.m2,
(sf.sum / (select count(*) from people)
- stats1.mean * stats2.mean
) -- covariance
/ (stats1.stddev * stats2.stddev)
from (
select r1.movie m1,
r2.movie m2,
sum(r1.rating * r2.rating) sum
from ratings r1
join ratings r2
on r1.person = r2.person
group by m1, m2
) sf
join movie_statistics stats1
on stats1.movie = sf.m1
join movie_statistics stats2
on stats2.movie = sf.m2
There, that wasn’t so bad was it? The inner query does the summing: By joining the ratings table to itself on the people we get all pairs of movies where we have a rating for both. We group by pairs of movies, thus summing over all the people who rated them. It’s then a simple matter to calculate the covariance and thus the correlation.
Or maybe I am crazy to look at a 23 line SQL query and say “that wasn’t so bad, was it?”. But at least I’m a crazy person whose code doesn’t run out of memory nearly as soon.
Here’s some code that ties all of this together: http://code.trampolinesystems.com/pearsons.rb
Posted in Uncategorized | 2 Comments »
Computing connected graph components via SQL
By david | Wednesday, November 19th, 2008
Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason – there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.
Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.
There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.
It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.
Consider the following setup:
create table if not exists items( id int primary key, component_id int ); create table if not exists links( first int references items (id), second int references items (id) );
We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.
We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):
create table if not exists components_to_merge( component1 int, component2 int);
(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).
The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.
So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.
Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:
insert into components_to_merge
select distinct t1.component_id c1, t2.component_id c2
from links
join items t1
on links.first = t1.id
join items t2
on links.second = t2.id
where t1.component_id != t2.component_id
insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity
So, now we have a list of groups to merge in this table. If the table is empty then we’re done – all the groups are maximal (and because of the way we constructed them they’re connected – at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:
update items
join (
select component1 source, min(component2) target
from components_to_merge
group by source
) new_components
on new_components.source = component_id
set items.component_id = least(items.component_id, target)
This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).
What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.
I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine – it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.
Anyway, here’s some sample code that ties all of this together: http://code.trampolinesystems.com/components.rb
Posted in Coding | 12 Comments »
BBC’s Free Thinking Festival 2008
By Charles Armstrong | Saturday, November 1st, 2008
Liverpool
As a part of the BBC’s Free Thinking Festival 2008, Trampoline CEO Charles Armstrong will be taking part in a debate on “What is the Value of Experience?”. Charles will be joined on the panel by Booker nominated novelist Andrew O’Hagan, brain scientist and writer Susan Blackmore and philosopher turned management consultant Robert Smith. The debate is taking place in Liverpool on 1st November and will be broadcast on BBC Radio 3 in the following weeks.
For further information, or to book tickets for the event please visit the event page, or the BBC’s Free Thinking Festival website.
Posted in Events | No Comments »








