Blog | Archive for the ‘Coding’ Category

Open source term extraction

By david | Monday, August 17th, 2009

This is just a quick announcement to let people know that we’ve open sourced our JRuby library for term extraction. You can get the code from my github page.

Unlike a lot of term extraction libraries, this doesn’t take any stance as to the “significance” of the terms it extracts. It’s purely about looking at the syntax and determining where good boundaries for terms are. There are a couple reasons for this, but basically we’ve found that it’s more effective to separate the two steps and makes it easier to tinker around with them independently. The criteria for “interestingness” of terms seem to be largely distinct from those for terms which simply make sense linguistically. So we have a two stage pipeline, one which extracts semantically meaningful terms and one which determines what terms are actually interesting in the context of the document. The second step is much more complicated, and we’re not open sourcing that (yet? probably not any time soon, if ever. Even if we wanted to, it relies on a lot more global information across the document corpus and so is very tied in with how SONAR operates, making it much harder to isolate).

So, how does it work? Black magic and voodoo!

Actually, no. It’s pretty straightforward. It builds on top of the excellent OpenNLP library, using its tools for part of speech tagging, sentence splitting (a much harder problem than you’d imagine) and phrase chunking. It’s currently a rules based system on top of there, as while you’re figuring things out it makes much more sense to stick with something so easily fine tunable. Our expectation is that we’ll gradually start replacing bits of it with machine learning based techniques as we start to hit the limitations of a rules based system, but for now it’s working pretty well.

Let’s have an example. If we feed the second paragraph of this post into the term extractor, we get the following terms back:

term extraction libraries
stance
terms
syntax
good boundaries
couple reasons
two steps
steps
criteria
interestingness
sense
two stage pipeline
stage pipeline
semantically meaningful terms
context
context of the document
document
second step
open sourcing
time
document corpus
SONAR

Hope you find this useful. Let us know if you build anything cool with it!


Open sourcing Pearson’s Correlation calculations

By david | Wednesday, April 22nd, 2009

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.


ActiveRecord-JDBC plugin for working with MySQL master-slave configurations

By mccraig | Friday, March 20th, 2009

here’s a little plugin for ActiveRecord-JDBC which enables simple use of MySQL master-slave configurations

active-record-jdbc-mysql-master-slave


Porting Pearsons to Postgres. Performance?

By david | Thursday, January 29th, 2009

I’ve uploaded a version of the Pearson’s Coefficient code which runs on postgresql. You can download it here.I wrote this as an experiment to see if Postgres could help us with some of our MySQL performance woes.

Some brief experimentation suggests that once you fix PostgreSQL’s ridiculous default configuration the performance story is relatively happy. At small sizes MySQL is moderately faster, but as the sizes get large PostgreSQL seems to take the lead. I don’t have any sort of formal benchmark yet: This needs much more testing before I can definitively claim either is faster than the other, but for now the signs in favour of PostgreSQL are promising.


Has Many + non default primary key loads incorrect data in Rails 2.2.2

By emma | Thursday, January 29th, 2009

I found an interesting bug in Rails 2.2.2 yesterday. I couldn’t find a similar bug on the rails lighthouse so created a new ticket. What was most interesting though, was how quick the rails core team picked up the bug and assigned it to someone.

It turns out that the bug had already been fixed in the current master branch of the rails git repo, though apparently no one had noticed it’s existence because I can’t find any references to this anywhere. I guess the fix in activerecord, which is almost identical to my fix below, will form part of the next release whenever that is.

I assume this is probably also the case for other has_* relationships, but have not verified.

I have a has_many association from class Foo to class Bar, where, for this specific relationship, the primary key on Foo is not id, nor is the foreign key on Bar id.

class Foo
  has_many, :bars, :primary_key => 'a_non_standard_key_name', :foreign_key => 'another_non_standard_key_name'
end

The relationship is one way, I have no need to navigate from Bar back to Foo, but only call a_foo.bars.

This works fine when working with a single object, but breaks down when you want to do eager association preloading to avoid n+1 query problem of loading bars for many foos.

When performing the following you find that

f = Foo.find :all, :include => :Bar
f.bars = [SOMETHING_UNEXPECTED]

The reason is that ActiveRecord creates the preloading query based on the default primary key of Foo (normally id).

It queries for Bar.another_non_standard_key_name matching Foo.id not Foo.a_non_standard_key_name

This causes seriously unexpected behaviour, and could easily go unnoticed since no errors are thrown.

I have found the hook in ActiveRecord where this functionality should be included and monkey patched for my system, because I need it now. I can’t vouch for it’s correctness, but we have many many specs for our product and none of them have broken because of this.

I’m running frozen rails 2.2.2

vendor/activerecord/lib/active_record/association_preload.rb, line 221

Change

primary_key_name = reflection.through_reflection_primary_key_name

to

primary_key_name = reflection.through_reflection_primary_key_name || reflection.options[:primary_key]

Hope this helps someone!


Yet another MySQL Fail

By david | Monday, January 26th, 2009

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (‘foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (‘foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> WTF????
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘WTF????’ at line 1

So, what’s going on here? I said to delete everything where the name was 0, but it deleted the row ‘foo’.

The following might help:

mysql> create table more_stuff(id int);
Query OK, 0 rows affected (0.19 sec)

mysql> insert into more_stuff values(‘foo’);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from more_stuff;
+——+
| id |
+——+
| 0 |
+——+
1 row in set (0.00 sec)

When you try to use a string as an integer in MySQL, it takes non numeric strings and turns them into zero. So when you test name = 0, it converts name into an integer and turns that into 0. Consequently strings which can’t be parsed as an integer result in true for this test.

At this point I would rant about how mindbogglingly stupid this behaviour is, but I don’t think I can really be bothered.


JRuby + Clojure’s Immutable Data Structures = Easy to maintain, application data-model.

By Daniel Kwiecinski | Thursday, January 22nd, 2009

Implementing an application with rich data-model which can be updated by multiple UI controls, many concurrent threads with undo/redo functionality may be somewhat cumbersome. In order to ease this task, the functional programming paradigm with the immutable data structures turned out to be useful.

Because all good developers are lazy, one should seek for reuse rather than reinventing required tools, especially when there is good existing one. I tried to follow that path. Since we are using JRuby as our language of choice here at Trampoline, I decided to look more closely at clojure’s immutable data structures. It is straightforward to use Java classes from JRuby which is described in many places on the web already (here, here & here). The unknown to me was how can I use clojure’s objects from Jruby. Apparently clojure data structures are delivered as pre-compiled java classes and no runtime interpretation/compilation of clojure scripts is needed. The task turned out to be very easy.

The simple implementation of graph data structure with no deletion functionality looks as simple as:

basic_graph1
In order to have Clojure collections look more like Ruby ones one can define aliases for their methods:

persistent_map

Unfortunately (or fortunately due to different contract) we can not do it with all the methods. Particularly with mutating ones. That’s because Ruby’s = (assign operator) semantics is to return the value being assign. It is analogous to []= method as well. So even if we redefine the []=(key, val) method so that the method returns the updated version of the collection, the Ruby interpreter will step into the scene and wrap the whole method, so that it eventually returns val. Anyway, whether this is good or bad is the topic for a whole other post.


mysql cast to floating point

By mccraig | Wednesday, January 14th, 2009

discovered another mysql trick

if you are experiencing underflow with mysql fixed point arithmetic, you may need to force a floating point evaluation. cast() does not support cast to floating point so multiply by 1e0 instead

e.g.

select 1000000000000000000000000000000 * ( 1 / ( 1000000000000000000000000000000 ));

returns the wrong answer, whereas :

select 1000000000000000000000000000000 * ( 1 / ( 1e0 * 1000000000000000000000000000000 ));

is fine and dandy


Flow of control in Debian maintainer scripts

By jan | Wednesday, January 7th, 2009

The Debian package installation process (as described in the Debian policy) is fairly complicated, at least internally. During the process of building a Debian package for our software I often had to check the policy manual for the order the various maintainer scripts (e.g. postinst, prerm etc.) are called. To complicate things further, both the old and new scripts get called (at least during an upgrade). There are various “error-unwinds” (= rollbacks) and final error states.

It struck me that a visual representation would make things a lot easier, so here’s one I knocked together in OmniGraffle, for future reference:

maintainer scripts call sequence

maintainer scripts call sequence


fixtures are evil, but so is mysql

By jan | Tuesday, December 9th, 2008

MySQL is not getting much love at the office, today was another of those days.

A little bit of background: we were in the process of replacing our fixture-based rails specs with “rspec scenarios”, a small extension we wrote for rails/rspec (to be released soon). The idea is that you create a scenario programatically rather than have static, hard to change fixtures in yaml. Each spec is run inside a transaction which gets rolled back, in the same way Rails handles this.

One particular spec was leaving the database in a inconsistent state, i.e. a transaction got committed. Debugging this problem took more time that I’m willing to admit here, but some of it was spend making the process a bit easier, using a logfilter for mysql:

It expects mysqld.log from stdin and will print logging output from separate transactions in different colours, as well as highlight the transaction demarcation points. However, everything looked ok in there, the transaction was properly demarcated + rolled back but still ended up being committed!

It turns out that some sql statements perform an implicit silent commit, effectively ignoring your defined boundaries. In our case TRUNCATE table was the culprit. The right behaviour here seems to either roll back the current transaction or at least produce some informational logging as to why the transaction got committed. The default behaviour just seems to be completely wrong and unintuitive (and cannot be disabled).


Next Page »

New York Office
234 5th Avenue, 4th Floor
New York, NY
10001, USA

London Office
The Trampery
8-15 Dereham Place
London EC2A 3HJ
United Kingdom