Trampoline Systems

* Trampoline Description Here

Trampoline Systems

* Trampoline Description Here


Content

Machines

Ideas, thoughts and observations from Trampoline's technical brains

Archive for the ‘Uncategorized’ Category

jan

type discussion on irc (my eyez)

By Jan Berkel on February 5th, 2009

A java programmer, a scala dev and a ruby guy meet on #irc. Says the java guy to the… wait, this is not the beginning of a joke.

jan: Map<String,Map<String,String>>  argggghh
thepete: mmm, readable code, yummy
David: Map<String, Map<String, String>> is perfectly reasonable. :)
       The real problem is that Java makes you declare it twice.
jan: Map<String, Map<String, Object>> m =
       new HashMap<String, Map<String, Object>>();
mccraig: stop that !
mccraig: my eyez
David: val m = new HashMap[String, Map[String, Object]]
thepete: ze goggles;
David: or val m = ne wHashMap[(String, String), Object] if you prefer. :)
jan: what type will m have? Map?
David: val m : Map[(String, String), Object] = new HashMap
mccraig: m = {}
David: m["1"] = "stuff"; m[1] # Why is this returning nil??? :(
mccraig: yeah i know, but whatever
jan: MapWithIndifferentAccess
mccraig: my eyez hurt less
David: my eyez weep
mccraig: u can get eyedrops for that
David: You can get gogglez. :-)
jan: can i blog this? :)

daniel

JRuby + Clojure’s Immutable Data Structures = Easy to maintain, application data-model.

By Daniel Kwiecinski on January 22nd, 2009

Implementing an application with rich data-model which can be updated by multiple UI controls, many concurrent threads with undo/redo functionality may be somewhat cumbersome. In order to ease this task, the functional programming paradigm with the immutable data structures turned out to be useful.

Because all good developers are lazy, one should seek for reuse rather than reinventing required tools, especially when there is good existing one. I tried to follow that path. Since we are using JRuby as our language of choice here at Trampoline, I decided to look more closely at clojure’s immutable data structures. It is straightforward to use Java classes from JRuby which is described in many places on the web already (here, here & here). The unknown to me was how can I use clojure’s objects from Jruby. Apparently clojure data structures are delivered as pre-compiled java classes and no runtime interpretation/compilation of clojure scripts is needed. The task turned out to be very easy.

The simple implementation of graph data structure with no deletion functionality looks as simple as:

basic_graph1

 
 

In order to have Clojure collections look more like Ruby ones one can define aliases for their methods:

persistent_map

 

Unfortunately (or fortunately due to different contract) we can not do it with all the methods. Particularly with mutating ones. That’s because Ruby’s = (assign operator) semantics is to return the value being assign. It is analogous to []= method as well. So even if we redefine the []=(key, val) method so that the method returns the updated version of the collection, the Ruby interpreter will step into the scene and wrap the whole method, so that it eventually returns val. Anyway, whether this is good or bad is the topic for a whole other post.

david

Calculating Pearson’s correlation coefficient in SQL

By David MacIver on November 23rd, 2008

Apparently I’m crazy.

Well, it’s not the first time this accusation has been levelled. But it’s one of the more amusing ones.

My colleague Jan Berkel was at Ruby Manor, where one of the talks was on acts_as_recommendable. Here’s what he had to say on the subject:

a plugin for rails to generate recommendations. lets you do things like user.similar_users / books.similar_books etc. it’s based on pearson correlation. the guy mentioned he ran into memory problems using large datasets for doing the computation (load all objects , calculate, store objects). i asked whether they tried performing the calculation on SQL level to avoid runtrips to the db (the way david implemented it). he first didn’t understand what i was saying, then someone from the audience said “you’d have to be crazy to do that”…  i guess this elegant solution is not the ruby way.. anyway.

This entertained me, as one of the things I introduced recently to our theme extraction process is using pearson’s coefficient to rate similarity of themes. It never even occurred to me that loading it all to the client side doing it in Ruby and then saving it back might be considered an option.

Pearsons is very far from being the most powerful metric in the world, but remarkably useful for all that: It gives a decent scaled measure of similarity and is cheap and easy to calculate.

Ok. Well. Easy to calculate. And cheap to calculate if you don’t do clever things like loading the entire data set into memory. Anyway…

All joking aside, it wasn’t entirely trivial to do, and clearly this is something which is of interest to more people than us, so this post is an explanation of how it works.

The Pearson product-moment corellation coefficient of two distributions is a measurement of their similarity. Given random variables X, Y we define the covariance of the two to be:

Cov(X, Y) = E(XY) - E(X) E(Y)

The reason for this slightly strange looking expression is that random variables have the nice property that for two independent random variables E(XY) = E(X) E(Y). So if X and Y are independent then Cov(X, Y) = E(XY) - E(X) E(Y) = E(X) E(Y) - E(X) E(Y) = 0. The converse isn’t generally true (it’s true under some circumstances), but nevertheless we can use Cov(X, Y) as a cheap measure of degree of independents.

It has some other nice properties: Suppose Y = aX. Then Cov(X, Y) = a Var(X). So if Y is a positive multiple of X then the covariance is positive, if Y is a negative multiple of X the covariance is negative.

This also shows us that there’s no inherent significance to the size of the covariance. It depends on the variance of both X and Y - in fact, scaling either will increase the covariance proportionately.

However it does have the following nice property:

|Cov(X, Y)| <= StdDev(X) StdDev(Y)

(When one is a multiple of the other, equality is achieved).

This allows us to define the pearson’s coefficient (normally written rho):

rho(X, Y) = Cov(X, Y) / (StdDev(X) * StdDev(Y))

Which by the above property takes values between -1 and 1 (it will take value 1 when one is a positive multiple of the other and value -1 when one is a negative multiple of the other).

That’s all there is to it. It’s a nice scaled measure of how similar the two distributions are.

So, why do we care and how do we use this?

Well, the details of how we actually use this are cunning and proprietary. So I can’t really tell you those (I’m hoping that we may be able to disclose some of them at some point, but not just yet). So let’s pretend we’re trying to win the netflix challenge (I think some solutions actually do use this, but this example has the nice property of having absolutely nothing to do with how we use it in SONAR). We have movies, we have people, we have ratings of movies by people. Given two movies we want to determine how similar they are, based on the people who liked them.

Sound interesting now?

We can do this by considering the ratings of each movie as a random variable. A person is a sample of this rating space. So by considering the set of all people as a sample we can calculate the pearsons corellation over this sample (I know I only explained how it works for random variables, but it carries over in the obvious way for samples).

So we’ve got a setup: People, movies, ratings. We’re going to calculate the correlation between movies and store the results in the database (I’d like to use views for this, but MySQL does horrible bad things to view join performance).

Before I proceed, there are a few subtle points:

  • Not everyone has rated every movie. I’m going to treat movies which have not been rated as implicitly having a rating of 0. This is sortof justified: Not watching a movie is often a stance about what sort of movie you like. Mostly it just makes the maths convenient. :-)
  • I’m going to ignore pairs of movies which have no rating in common. These have a corellation which is easy to work out (rho(x, y) = - mean(x) * mean(y) / (std_dev(x), std_dev(y))), and there tend to be a lot of them. This simplifies calculations and reduces storage overhead. Heavily anticorrelated movies also tend to be more because of holes in the data than interesting facts.

So let’s imagine our schema is set up as follows:

create table people(
  id int primary key
)

create table movies(
  id int primary key
)

create table ratings(
  person int references people (id),
  movie int references movies (id),
  rating float not null,
  unique key (person, movie)
)

We start out by computing mean and standard deviations for movies. Normally we’d use the built in functions for this, but unfortunately because we’ve got all those implicit 0 ratings floating around we have to do it ourselves.

Here’s the table in which we’ll store per movie stats:

  create table movie_statistics(
    movie int primary key references movies(id),
    mean float,
    stddev float
  )

We’ll only actually include stats for movies which have been rated at some point.

First we’ll calculate the mean:

  insert into movie_statistics (movie, mean)
  select movie, sum(rating) / (select count(*) from people) mean
  from ratings
  group by movie

Then the standard deviation:

  update movie_statistics
  set stddev = (
    select sqrt(
            sum(rating * rating) / (select count(*) from people)
            - mean * mean
           ) stddev
    from ratings
    where ratings.movie = movie_statistics.movie
    group by ratings.movie)

All perfectly straightforward, if a little annoyingly wordy.

Now we get to the actual interesting bit: Calculating correlations. First we create a table in which to store the results:

create table movie_correlations(

movie1 int references movies(id),

movie2 int references movies(id),
rho float,
unique key (movie1, movie2),
key (movie1),
key (movie2)
)
Now we need to populate it. This will proceed in essentially three parts. First, we need to calculate E(XY), so for every pair of co-occurring movies we’ll get a sum of their products over all people who have rated them. That’s Sum(XY). We then divide by the number of people to get the average value of XY (implicit 0s again). Then we’ll use our already calculated statistics to give us the covariance (E(XY) - E(X) * E(Y)), and then the correlation (divide by standard deviations).Or, to put it altogether in one massive great big query:

  insert into movie_correlations (
    movie1,
    movie2,
    rho
  )
  select sf.m1,
         sf.m2,
         (sf.sum / (select count(*) from people)
          - stats1.mean * stats2.mean
         ) -- covariance
         / (stats1.stddev * stats2.stddev)
  from (
    select r1.movie m1,
           r2.movie m2,
           sum(r1.rating * r2.rating) sum
    from ratings r1
    join ratings r2
    on r1.person = r2.person
    group by m1, m2
  ) sf
  join movie_statistics stats1
  on stats1.movie = sf.m1
  join movie_statistics stats2
  on stats2.movie = sf.m2

There, that wasn’t so bad was it? The inner query does the summing: By joining the ratings table to itself on the people we get all pairs of movies where we have a rating for both. We group by pairs of movies, thus summing over all the people who rated them. It’s then a simple matter to calculate the covariance and thus the correlation.

Or maybe I am crazy to look at a 23 line SQL query and say “that wasn’t so bad, was it?”. But at least I’m a crazy person whose code doesn’t run out of memory nearly as soon. :-)

Here’s some code that ties all of this together: http://code.trampolinesystems.com/pearsons.rb

jon

Produce a members report for all your Mailman lists

By Jon Cowie on October 14th, 2008

I recently had cause to produce a report on the membership of all our Mailman mailing lists, so rather than doing it manually I knocked together the following handy bash script…change mailman location and output file as desired :)

OUTPUTFILE="/tmp/mailman_report"
CURRMONTH=`date +%m-%Y`
LISTS=`/usr/local/mailman/bin/list_lists  | awk '{print $1}' | grep -v [!0-9]`
rm ${OUTPUTFILE}
echo "Mailman Report for ${CURRMONTH}" > ${OUTPUTFILE}
echo >> ${OUTPUTFILE}
for x in ${LISTS}
do
        echo "Members of List ${x}:" >> ${OUTPUTFILE}
        LIST_MEMBERS=`/usr/local/mailman/bin/list_members ${x}`
        for mems in ${LIST_MEMBERS}
        do
                echo ${mems} >> ${OUTPUTFILE}
        done
echo >> ${OUTPUTFILE}
done
/bin/mail -s "Mailman_Report_for_${CURRMONTH}" foo@foo.com -c blah@blah.com < ${OUTPUTFILE}