Trampoline Systems

* Trampoline Description Here

Trampoline Systems

* Trampoline Description Here


Content

Machines

Ideas, thoughts and observations from Trampoline's technical brains

Archive for December, 2008

jan

fixtures are evil, but so is mysql

By Jan Berkel on December 9th, 2008

MySQL is not getting much love at the office, today was another of those days.

A little bit of background: we were in the process of replacing our fixture-based rails specs with “rspec scenarios”, a small extension we wrote for rails/rspec (to be released soon). The idea is that you create a scenario programatically rather than have static, hard to change fixtures in yaml. Each spec is run inside a transaction which gets rolled back, in the same way Rails handles this.

One particular spec was leaving the database in a inconsistent state, i.e. a transaction got committed. Debugging this problem took more time that I’m willing to admit here, but some of it was spend making the process a bit easier, using a logfilter for mysql:

It expects mysqld.log from stdin and will print logging output from separate transactions in different colours, as well as highlight the transaction demarcation points. However, everything looked ok in there, the transaction was properly demarcated + rolled back but still ended up being committed!

It turns out that some sql statements perform an implicit silent commit, effectively ignoring your defined boundaries. In our case TRUNCATE table was the culprit. The right behaviour here seems to either roll back the current transaction or at least produce some informational logging as to why the transaction got committed. The default behaviour just seems to be completely wrong and unintuitive (and cannot be disabled).

david

Pearsons in the database, part 2

By David MacIver on December 9th, 2008

Before I explain what this is about, the following tweets provide useful context for how I feel about this:

http://twitter.com/DRMacIver/status/1047320819

http://twitter.com/DRMacIver/status/1047321174

We’ve been discovering that the SQL query I wrote for calculating pearsons, while it works fine for small datasets (say a few hundred thousand ratings), once you get to around the million rating mark starts being unusably slow, even for a nightly job. Basically if you look at the query plan it ends up doing a filesort. This is not nice, as it involves an awful lot of data paging to and from disk. Having tried to optimise it directly and failed, I’ve spent the last two weeks writing nasty hacks to try to make it fast and was going to follow up to the blog post with my final solution (which involves a temporary table and a stored procedure. It’s pretty grim).

But today I was doing something else which involved a similar sort of query and got really pissed off that I was having exactly the same problems. I boiled it down to a very simple example which illustrated it and logged onto freenode’s #mysql to see if anyone could help me figure out what’s going on. Last time I tried this I was not successful, but one lives in hope.

Well, as it turns out, someone could. Here’s a snippet of conversation:

16:20 < DRMacIver> I'm finding this pattern comes up a lot in what I'm doing at
                   the moment, and I simply can't figure out a sensible way to
                   optimise it. Any suggestions? http://pastebin.com/m5be98ace
16:21 < DRMacIver> The self join on the same column followed by a group by seems
                   to produce a filesort no matter what I do. As per example,
                   basically all indices that could exist on the table do.
16:22 < jbalint> DRMacIver: i think you can order by null
16:22  * DRMacIver boggles
16:22 < DRMacIver> That works. Thank you.

So, by updating one line in the SQL I’ve posted previously the query becomes dramatically faster. I’ve also updated it to use an update join (which is MySQL specific) instead of the dependent subquery in the set (which is not, but which MySQL runs appallingly slowly) in order to get the standard deviation calculations to run in reasonable time.

Why is this happening? Well, http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html contains the answer. It turns out that if you have a group by then it implicitly orders by that group. Even if the group clause has no index on it (which it can’t in this case because it’s a join across two tables), and ordering a large dataset by a non-indexed key will cause a filesort. If you add the clause “order by null” then it will not order by the group by clause, and the filesort goes away and the query becomes much faster.

We are not amused.

As another general tip, we’ve also found that this rapidly starts to generate unacceptably large amount of data, but that if you set some sensible lower bounds (such as only generating the statistics for things which cooccur more than once and have a pearsons > 0.1) it can easily be reduced to sensible levels.