Trampoline Systems

* Trampoline Description Here

Trampoline Systems

* Trampoline Description Here


Content

Machines

Ideas, thoughts and observations from Trampoline's technical brains

Archive for the ‘Code’ Category

david

Open source term extraction

By David MacIver on August 17th, 2009

This is just a quick announcement to let people know that we’ve open sourced our JRuby library for term extraction. You can get the code from my github page.

Unlike a lot of term extraction libraries, this doesn’t take any stance as to the “significance” of the terms it extracts. It’s purely about looking at the syntax and determining where good boundaries for terms are. There are a couple reasons for this, but basically we’ve found that it’s more effective to separate the two steps and makes it easier to tinker around with them independently. The criteria for “interestingness” of terms seem to be largely distinct from those for terms which simply make sense linguistically. So we have a two stage pipeline, one which extracts semantically meaningful terms and one which determines what terms are actually interesting in the context of the document. The second step is much more complicated, and we’re not open sourcing that (yet? probably not any time soon, if ever. Even if we wanted to, it relies on a lot more global information across the document corpus and so is very tied in with how SONAR operates, making it much harder to isolate).

So, how does it work? Black magic and voodoo!

Actually, no. It’s pretty straightforward. It builds on top of the excellent OpenNLP library, using its tools for part of speech tagging, sentence splitting (a much harder problem than you’d imagine) and phrase chunking. It’s currently a rules based system on top of there, as while you’re figuring things out it makes much more sense to stick with something so easily fine tunable. Our expectation is that we’ll gradually start replacing bits of it with machine learning based techniques as we start to hit the limitations of a rules based system, but for now it’s working pretty well.

Let’s have an example. If we feed the second paragraph of this post into the term extractor, we get the following terms back:

term extraction libraries
stance
terms
syntax
good boundaries
couple reasons
two steps
steps
criteria
interestingness
sense
two stage pipeline
stage pipeline
semantically meaningful terms
context
context of the document
document
second step
open sourcing
time
document corpus
SONAR

Hope you find this useful. Let us know if you build anything cool with it!

david

Open sourcing Pearson’s Correlation calculations

By David MacIver on April 22nd, 2009

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it  available on github. It’s in a bit of a rough state at the moment, but is usable as is.

mccraig

ActiveRecord-JDBC plugin for working with MySQL master-slave configurations

By craig mcmillan on March 20th, 2009

here’s a little plugin for ActiveRecord-JDBC which enables simple use of MySQL master-slave configurations

active-record-jdbc-mysql-master-slave

jan

type discussion on irc (my eyez)

By Jan Berkel on February 5th, 2009

A java programmer, a scala dev and a ruby guy meet on #irc. Says the java guy to the… wait, this is not the beginning of a joke.

jan: Map<String,Map<String,String>>  argggghh
thepete: mmm, readable code, yummy
David: Map<String, Map<String, String>> is perfectly reasonable. :)
       The real problem is that Java makes you declare it twice.
jan: Map<String, Map<String, Object>> m =
       new HashMap<String, Map<String, Object>>();
mccraig: stop that !
mccraig: my eyez
David: val m = new HashMap[String, Map[String, Object]]
thepete: ze goggles;
David: or val m = ne wHashMap[(String, String), Object] if you prefer. :)
jan: what type will m have? Map?
David: val m : Map[(String, String), Object] = new HashMap
mccraig: m = {}
David: m["1"] = "stuff"; m[1] # Why is this returning nil??? :(
mccraig: yeah i know, but whatever
jan: MapWithIndifferentAccess
mccraig: my eyez hurt less
David: my eyez weep
mccraig: u can get eyedrops for that
David: You can get gogglez. :-)
jan: can i blog this? :)

jan

java signed types fail

By Jan Berkel on January 29th, 2009

Ok, another fail post, this time regarding java.

Remember, Java has only signed types (int, byte, short, long) - char is the only exception.

WTF?

James Gosling says (source):

Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is.
Things like that made C complex. The language part of Java is, I think, pretty simple.

Ok, so he is basically saying that J. Random Developer is too stupid to care about differences in signedness, and decided to make them all signed, maybe for consistency reasons or who knows. Sometimes you don’t care about signed variables. Say for example you’re dealing with any sort of  decoding/encoding problems (like network protocols). Now you have to make sure that you don’t trip over sign problems (esp. with implicit casts). Here’s a snippet from the JRuby codebase (base64 decoding routines):

   private static byte safeGet(ByteBuffer encode) {
     return encode.hasRemaining() ? encode.get() : 0;
   }
   ...
   int s = safeGet(encode);
   while (((a = b64_xtable[s]) == -1) && encode.hasRemaining()) {
      // do something
   }

In Java, a byte can have a value from -128 to 127. This code above works fine, except when you feed it data which is not just in the range of displayable ascii characters, because the variable “s” will then potentially be negative and cause an ArrayOutOfBoundsException. So the general recommendation is to use the next bigger signed type (short instead of byte, long instead of int). If you’re using someone else’s API (java.nio.ByteBuffer in this case) you don’t have the choice and therefore you have to be extra careful when using it. In this case the right thing to do is a bitwise AND with 0xFF to strip the signed part of the int.

  while (((a = b64_xtable[s & 0xff]) == -1) && encode.hasRemaining()) { // do something }

Here’s the JRuby bug report and a general introduction to java sign issues by Sean R. Owens.

david

Porting Pearsons to Postgres. Performance?

By David MacIver on January 29th, 2009

I’ve uploaded a version of the Pearson’s Coefficient code which runs on postgresql. You can download it here.I wrote this as an experiment to see if Postgres could help us with some of our MySQL performance woes.

Some brief experimentation suggests that once you fix PostgreSQL’s ridiculous default configuration the performance story is relatively happy. At small sizes MySQL is moderately faster, but as the sizes get large PostgreSQL seems to take the lead. I don’t have any sort of formal benchmark yet: This needs much more testing before I can definitively claim either is faster than the other, but for now the signs in favour of PostgreSQL are promising.

emma

Has Many + non default primary key loads incorrect data in Rails 2.2.2

By Emma Persky on January 29th, 2009

 

I found an interesting bug in Rails 2.2.2 yesterday. I couldn’t find a similar bug on the rails lighthouse so created a new ticket. What was most interesting though, was how quick the rails core team picked up the bug and assigned it to  someone. 

It turns out that the bug had already been fixed in the current master branch of the rails git repo, though apparently no one had noticed it’s existence because I can’t find any references to this anywhere. I guess the fix in activerecord, which is almost identical to my fix below, will form part of the next release whenever that is.

I assume this is probably also the case for other has_* relationships, but have not verified.

I have a has_many association from class Foo to class Bar, where, for this specific relationship, the primary key on Foo is not id, nor is the foreign key on Bar id.

class Foo
  has_many, :bars, :primary_key => 'a_non_standard_key_name', :foreign_key => 'another_non_standard_key_name'
end

The relationship is one way, I have no need to navigate from Bar back to Foo, but only call a_foo.bars.

This works fine when working with a single object, but breaks down when you want to do eager association preloading to avoid n+1 query problem of loading bars for many foos.

When performing the following you find that

f = Foo.find :all, :include => :Bar
f.bars = [SOMETHING_UNEXPECTED]

The reason is that ActiveRecord creates the preloading query based on the default primary key of Foo (normally id).

It queries for Bar.another_non_standard_key_name matching Foo.id not Foo.a_non_standard_key_name

This causes seriously unexpected behaviour, and could easily go unnoticed since no errors are thrown.

I have found the hook in ActiveRecord where this functionality should be included and monkey patched for my system, because I need it now. I can’t vouch for it’s correctness, but we have many many specs for our product and none of them have broken because of this.

I’m running frozen rails 2.2.2

vendor/activerecord/lib/active_record/association_preload.rb, line 221

Change

primary_key_name = reflection.through_reflection_primary_key_name

to

primary_key_name = reflection.through_reflection_primary_key_name || reflection.options[:primary_key]

Hope this helps someone!

david

Yet another MySQL Fail

By David MacIver on January 26th, 2009

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo  |
| 1    |
| 0    |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1    |
+——+
1 row in set (0.00 sec)

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo  |
| 1    |
| 0    |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1    |
+——+
1 row in set (0.00 sec)

mysql> WTF????
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘WTF????’ at line 1

So, what’s going on here? I said to delete everything where the name was 0, but it deleted the row ‘foo’.

The following might help:

mysql> create table more_stuff(id int);
Query OK, 0 rows affected (0.19 sec)

mysql> insert into more_stuff values(’foo’);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from more_stuff;
+——+
| id   |
+——+
|    0 |
+——+
1 row in set (0.00 sec)

When you try to use a string as an integer in MySQL, it takes non numeric strings and turns them into zero. So when you test name = 0, it converts name into an integer and turns that into 0. Consequently strings which can’t be parsed as an integer result in true for this test.

At this point I would rant about how mindbogglingly stupid this behaviour is, but I don’t think I can really be bothered.

mccraig

mysql cast to floating point

By craig mcmillan on January 14th, 2009

discovered another mysql trick

if you are experiencing underflow with mysql fixed point arithmetic, you may need to force a floating point evaluation. cast() does not support cast to floating point so multiply by 1e0 instead

e.g.

select  1000000000000000000000000000000 *  ( 1 / ( 1000000000000000000000000000000 ));

returns the wrong answer, whereas : 

select  1000000000000000000000000000000 *  ( 1 / ( 1e0 * 1000000000000000000000000000000 ));

is fine and dandy

jan

fixtures are evil, but so is mysql

By Jan Berkel on December 9th, 2008

MySQL is not getting much love at the office, today was another of those days.

A little bit of background: we were in the process of replacing our fixture-based rails specs with “rspec scenarios”, a small extension we wrote for rails/rspec (to be released soon). The idea is that you create a scenario programatically rather than have static, hard to change fixtures in yaml. Each spec is run inside a transaction which gets rolled back, in the same way Rails handles this.

One particular spec was leaving the database in a inconsistent state, i.e. a transaction got committed. Debugging this problem took more time that I’m willing to admit here, but some of it was spend making the process a bit easier, using a logfilter for mysql:

It expects mysqld.log from stdin and will print logging output from separate transactions in different colours, as well as highlight the transaction demarcation points. However, everything looked ok in there, the transaction was properly demarcated + rolled back but still ended up being committed!

It turns out that some sql statements perform an implicit silent commit, effectively ignoring your defined boundaries. In our case TRUNCATE table was the culprit. The right behaviour here seems to either roll back the current transaction or at least produce some informational logging as to why the transaction got committed. The default behaviour just seems to be completely wrong and unintuitive (and cannot be disabled).