Blog | Archive for November, 2008

rails 2.2 + jruby + jetty = win

By jan | Thursday, November 27th, 2008

In case you missed it, rails 2.2 recently got released, finally promising thread safety among some other things. Thread safety has always been neglected by the rails core team, the standard way to scale up in rails (pre 2.2) is to run multiple processes, which makes deployment a lot harder (I think there’re at least 10 different ways to deploy rails apps at the moment, and people still come up with new solutions: apache+fcgi, mongrel, mongrel_cluster, thin, phusion, rack…).

Why has thread safety become a priority all of a sudden? I suspect one of the drivers is JRuby, which is now a viable alternative to MRI Ruby, and which also has the nice property of mapping Ruby threads to native threads. Another factor might be the arrival of merb, the new kid on the ‘ruby web framework’ block. Merb has been designed with thread safety in mind, and is now starting to get a lot of attention (1.0 has just been released).

Now, with a thread safe rails JRuby might become the platform of choice for deploying rails apps, especially given the performance progress the JRuby team is making. Having real threads does make a huge difference, reducing the memory footprint and making better use of multi core cpus.

There’re a couple of possibilites to deploy a rails application in JRuby, glassfish seems to be the recommended choice at the moment. However glassfish is anything but easily embeddable so I tried jetty as an option. Compared to glassfish, jetty is solid and proven (version 7 will be released soon), small and easily embeddable.

I didn’t want to use warbler (no web.xml please!), instead I used a combination of JRuby-Rack + Jetty7 and tied everything together with a simple JRuby script.

 server = org.mortbay.jetty.Server.new
 thread_pool = org.mortbay.thread.QueuedThreadPool.new
 thread_pool.min_threads  = 5  # adjust as needed
 thread_pool.max_threads  = 50
 server.set_thread_pool(thread_pool)
 connector = org.mortbay.jetty.nio.SelectChannelConnector.new
 connector.port = 3000

 context = org.mortbay.jetty.servlet.Context.new(nil, "/",
    org.mortbay.jetty.servlet.Context::NO_SESSIONS)
 context.add_filter("org.jruby.rack.RackFilter", "/*",
    org.mortbay.jetty.Handler::DEFAULT)
 context.set_resource_base(RAILS_DIR)
 context.add_event_listener(org.jruby.rack.rails.RailsServletContextListener.new)
 context.set_init_params(java.util.HashMap.new(
    'rails.root'=> '.', 'public.root' => 'public',
    'org.mortbay.jetty.servlet.Default.relativeResourceBase' => '/public',
    'jruby.max.runtimes' => '1'))
 context.add_servlet(org.mortbay.jetty.servlet.ServletHolder.new(
     org.mortbay.jetty.servlet.DefaultServlet.new), "/")

 server.set_handler(context)
 server.start

This will run jetty on port 3000, dispatching all requests for dynamic content to a single JRuby instance. It is important to set “‘jruby.max.runtimes” to 1, so it’ll create a shared application runtime for you, otherwise you’ll get the old one runtime per thread model.

On the rails side you need “config.threadsafe!” in the configuration file. Autoloading of classes will then be disabled, be sure to load all your dependencies upfront in environment.rb. We haven’t actually used this in production, but some initial tests look very promising (mongrel: 23req/s, jetty: 50 req/s). Also, deployment will be a lot easier, because static and dynamic content can be served by one single process.


Calculating Pearson's correlation coefficient in SQL

By david | Sunday, November 23rd, 2008

Apparently I’m crazy.

Well, it’s not the first time this accusation has been levelled. But it’s one of the more amusing ones.

My colleague Jan Berkel was at Ruby Manor, where one of the talks was on acts_as_recommendable. Here’s what he had to say on the subject:

a plugin for rails to generate recommendations. lets you do things like user.similar_users / books.similar_books etc. it’s based on pearson correlation. the guy mentioned he ran into memory problems using large datasets for doing the computation (load all objects , calculate, store objects). i asked whether they tried performing the calculation on SQL level to avoid runtrips to the db (the way david implemented it). he first didn’t understand what i was saying, then someone from the audience said “you’d have to be crazy to do that”… i guess this elegant solution is not the ruby way.. anyway.

This entertained me, as one of the things I introduced recently to our theme extraction process is using pearson’s coefficient to rate similarity of themes. It never even occurred to me that loading it all to the client side doing it in Ruby and then saving it back might be considered an option.

Pearsons is very far from being the most powerful metric in the world, but remarkably useful for all that: It gives a decent scaled measure of similarity and is cheap and easy to calculate.

Ok. Well. Easy to calculate. And cheap to calculate if you don’t do clever things like loading the entire data set into memory. Anyway…

All joking aside, it wasn’t entirely trivial to do, and clearly this is something which is of interest to more people than us, so this post is an explanation of how it works.

The Pearson product-moment corellation coefficient of two distributions is a measurement of their similarity. Given random variables X, Y we define the covariance of the two to be:

Cov(X, Y) = E(XY) – E(X) E(Y)

The reason for this slightly strange looking expression is that random variables have the nice property that for two independent random variables E(XY) = E(X) E(Y). So if X and Y are independent then Cov(X, Y) = E(XY) – E(X) E(Y) = E(X) E(Y) – E(X) E(Y) = 0. The converse isn’t generally true (it’s true under some circumstances), but nevertheless we can use Cov(X, Y) as a cheap measure of degree of independents.

It has some other nice properties: Suppose Y = aX. Then Cov(X, Y) = a Var(X). So if Y is a positive multiple of X then the covariance is positive, if Y is a negative multiple of X the covariance is negative.

This also shows us that there’s no inherent significance to the size of the covariance. It depends on the variance of both X and Y – in fact, scaling either will increase the covariance proportionately.

However it does have the following nice property:

|Cov(X, Y)| <= StdDev(X) StdDev(Y)

(When one is a multiple of the other, equality is achieved).

This allows us to define the pearson’s coefficient (normally written rho):

rho(X, Y) = Cov(X, Y) / (StdDev(X) * StdDev(Y))

Which by the above property takes values between -1 and 1 (it will take value 1 when one is a positive multiple of the other and value -1 when one is a negative multiple of the other).

That’s all there is to it. It’s a nice scaled measure of how similar the two distributions are.

So, why do we care and how do we use this?

Well, the details of how we actually use this are cunning and proprietary. So I can’t really tell you those (I’m hoping that we may be able to disclose some of them at some point, but not just yet). So let’s pretend we’re trying to win the netflix challenge (I think some solutions actually do use this, but this example has the nice property of having absolutely nothing to do with how we use it in SONAR). We have movies, we have people, we have ratings of movies by people. Given two movies we want to determine how similar they are, based on the people who liked them.

Sound interesting now?

We can do this by considering the ratings of each movie as a random variable. A person is a sample of this rating space. So by considering the set of all people as a sample we can calculate the pearsons corellation over this sample (I know I only explained how it works for random variables, but it carries over in the obvious way for samples).

So we’ve got a setup: People, movies, ratings. We’re going to calculate the correlation between movies and store the results in the database (I’d like to use views for this, but MySQL does horrible bad things to view join performance).

Before I proceed, there are a few subtle points:

  • Not everyone has rated every movie. I’m going to treat movies which have not been rated as implicitly having a rating of 0. This is sortof justified: Not watching a movie is often a stance about what sort of movie you like. Mostly it just makes the maths convenient. :-)
  • I’m going to ignore pairs of movies which have no rating in common. These have a corellation which is easy to work out (rho(x, y) = – mean(x) * mean(y) / (std_dev(x), std_dev(y))), and there tend to be a lot of them. This simplifies calculations and reduces storage overhead. Heavily anticorrelated movies also tend to be more because of holes in the data than interesting facts.

So let’s imagine our schema is set up as follows:

create table people(
  id int primary key
)

create table movies(
  id int primary key
)

create table ratings(
  person int references people (id),
  movie int references movies (id),
  rating float not null,
  unique key (person, movie)
)

We start out by computing mean and standard deviations for movies. Normally we’d use the built in functions for this, but unfortunately because we’ve got all those implicit 0 ratings floating around we have to do it ourselves.

Here’s the table in which we’ll store per movie stats:

  create table movie_statistics(
    movie int primary key references movies(id),
    mean float,
    stddev float
  )

We’ll only actually include stats for movies which have been rated at some point.

First we’ll calculate the mean:

  insert into movie_statistics (movie, mean)
  select movie, sum(rating) / (select count(*) from people) mean
  from ratings
  group by movie

Then the standard deviation:

  update movie_statistics
  set stddev = (
    select sqrt(
            sum(rating * rating) / (select count(*) from people)
            - mean * mean
           ) stddev
    from ratings
    where ratings.movie = movie_statistics.movie
    group by ratings.movie)

All perfectly straightforward, if a little annoyingly wordy.

Now we get to the actual interesting bit: Calculating correlations. First we create a table in which to store the results:

create table movie_correlations(

movie1 int references movies(id),

movie2 int references movies(id),
rho float,
unique key (movie1, movie2),
key (movie1),
key (movie2)
)
Now we need to populate it. This will proceed in essentially three parts. First, we need to calculate E(XY), so for every pair of co-occurring movies we’ll get a sum of their products over all people who have rated them. That’s Sum(XY). We then divide by the number of people to get the average value of XY (implicit 0s again). Then we’ll use our already calculated statistics to give us the covariance (E(XY) – E(X) * E(Y)), and then the correlation (divide by standard deviations).Or, to put it altogether in one massive great big query:

  insert into movie_correlations (
    movie1,
    movie2,
    rho
  )
  select sf.m1,
         sf.m2,
         (sf.sum / (select count(*) from people)
          - stats1.mean * stats2.mean
         ) -- covariance
         / (stats1.stddev * stats2.stddev)
  from (
    select r1.movie m1,
           r2.movie m2,
           sum(r1.rating * r2.rating) sum
    from ratings r1
    join ratings r2
    on r1.person = r2.person
    group by m1, m2
  ) sf
  join movie_statistics stats1
  on stats1.movie = sf.m1
  join movie_statistics stats2
  on stats2.movie = sf.m2

There, that wasn’t so bad was it? The inner query does the summing: By joining the ratings table to itself on the people we get all pairs of movies where we have a rating for both. We group by pairs of movies, thus summing over all the people who rated them. It’s then a simple matter to calculate the covariance and thus the correlation.

Or maybe I am crazy to look at a 23 line SQL query and say “that wasn’t so bad, was it?”. But at least I’m a crazy person whose code doesn’t run out of memory nearly as soon. :-)

Here’s some code that ties all of this together: http://code.trampolinesystems.com/pearsons.rb


Computing connected graph components via SQL

By david | Wednesday, November 19th, 2008

Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.

As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason – there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.

Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.

There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.

It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.

Consider the following setup:

create table if not exists items(
  id int primary key,
  component_id int
);

create table if not exists links(
  first int references items (id),
  second int references items (id)
);

We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.

We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):

create table if not exists components_to_merge(
  component1 int,
  component2 int);

(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).

The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.

So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.

Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:

insert into components_to_merge
            select distinct t1.component_id c1, t2.component_id c2
            from links
            join items t1
            on links.first = t1.id
            join items t2
            on links.second = t2.id
            where t1.component_id != t2.component_id

insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity

So, now we have a list of groups to merge in this table. If the table is empty then we’re done – all the groups are maximal (and because of the way we constructed them they’re connected – at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:

    update items
    join (
      select component1 source, min(component2) target
      from components_to_merge
      group by source
    ) new_components
    on new_components.source = component_id
    set items.component_id = least(items.component_id, target)

This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).

What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.

I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine – it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.

Anyway, here’s some sample code that ties all of this together: http://code.trampolinesystems.com/components.rb


Community vs. Large Scale Barcamps

By emma | Monday, November 3rd, 2008

Before I even consider this, I think my title is ineffectual in actually grasping the differences I want to point out. This comes from a distorted sense of scale we have when it comes to barcamps. What makes a large scale camp large scale? 50 people, 100, 1000? And what makes a community camp community based? The people, the location, the feeling?

Some camps which I would consider very much community based, such as barcampcork are actually quite large (over 100 people), whereas others are much smaller (such as barcampnortheast with around 30 attendees), and yet both maintain a strong community feeling. On the other hand, barcamplondon 3 and 4 (neither of which were attended by significantly more than 100 people), didn’t come close to exuding that sense of community. How ironic that I live in London, and that’s where my local community is. I’m not knocking the London camps for not having that community feeling, I’m not even sure that it is something that organisers can even control, but is drawn out from the people that attend. And in reality, the types of people drawn to barcamplondon and quite different from the people drawn to barcampcork. Almost every attendee at Cork, with the notable exception of myself, has a vested interest in the Cork and wider Ireland geek community, whereas the London attendees either seemed somewhat blasé about the whole barcamp thing (are we that spoilt over here…?), or were in town for the barcamp.

We can also consider super-sized camps such as barcampberlin which I attended a few weeks ago, which was due to be attended by 600 people, and drew somewhere around the 450 mark (my guesstimate). It’s actually quite hard for me to make a comparison with berlin because of the nature of my attendance at the event. I travelled to berlin with a few friends, and very quickly we started hanging out with other Londoners at the event. I don’t believe I spoke to more than 5 local, or even Germany based, attendees throughout the entire event. This came down to the fact that I was very comfortable knowing my 10 friends, and knowing all of their friends, who happened, mostly, to be my friends too, and there was no need to break out from our group. Additionally none of the sessions were in any way discussion group oriented, and so there was no permission to engage with total strangers about mutually interesting topics. This makes it very hard to meet people.

Maybe it all comes down to my social networks at these various events, but I do have a strong suspicion, that this is influenced by the style of event, the type of people attracted to it, and the pre existing connections you go with. Could Berlin have been made as warm and welcoming at Cork? Probably not, but simply because the logistics of making an event that large work on a small community scale are virtually impossible. What about London? I think engaging the community in organisation would have been a big win. There was a feeling of isolation between the organisation of the camp and it’s attendees, which somewhat undermines some of the core principles. Ultimately I think any barcamp has the potential to feel “just right,” but the parameters for this kind of success don’t always lend themselves to “popular” destination such as Berlin and London.

All that said, I’d love to organise a community based barcamplondon. Who want’s to join me?


BBC’s Free Thinking Festival 2008

By Charles Armstrong | Saturday, November 1st, 2008

Liverpool
Free Thinking 08As a part of the BBC’s Free Thinking Festival 2008, Trampoline CEO Charles Armstrong will be taking part in a debate on “What is the Value of Experience?”. Charles will be joined on the panel by Booker nominated novelist Andrew O’Hagan, brain scientist and writer Susan Blackmore and philosopher turned management consultant Robert Smith. The debate is taking place in Liverpool on 1st November and will be broadcast on BBC Radio 3 in the following weeks.

For further information, or to book tickets for the event please visit the event page, or the BBC’s Free Thinking Festival website.



New York Office
234 5th Avenue, 4th Floor
New York, NY
10001, USA

London Office
The Trampery
8-15 Dereham Place
London EC2A 3HJ
United Kingdom