Trampoline Systems

* Trampoline Description Here

Trampoline Systems

* Trampoline Description Here


Content

Machines

Ideas, thoughts and observations from Trampoline's technical brains

Archive for the ‘SQL’ Category

david

Porting Pearsons to Postgres. Performance?

By David MacIver on January 29th, 2009

I’ve uploaded a version of the Pearson’s Coefficient code which runs on postgresql. You can download it here.I wrote this as an experiment to see if Postgres could help us with some of our MySQL performance woes.

Some brief experimentation suggests that once you fix PostgreSQL’s ridiculous default configuration the performance story is relatively happy. At small sizes MySQL is moderately faster, but as the sizes get large PostgreSQL seems to take the lead. I don’t have any sort of formal benchmark yet: This needs much more testing before I can definitively claim either is faster than the other, but for now the signs in favour of PostgreSQL are promising.

david

Yet another MySQL Fail

By David MacIver on January 26th, 2009

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo  |
| 1    |
| 0    |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1    |
+——+
1 row in set (0.00 sec)

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo  |
| 1    |
| 0    |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1    |
+——+
1 row in set (0.00 sec)

mysql> WTF????
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘WTF????’ at line 1

So, what’s going on here? I said to delete everything where the name was 0, but it deleted the row ‘foo’.

The following might help:

mysql> create table more_stuff(id int);
Query OK, 0 rows affected (0.19 sec)

mysql> insert into more_stuff values(’foo’);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from more_stuff;
+——+
| id   |
+——+
|    0 |
+——+
1 row in set (0.00 sec)

When you try to use a string as an integer in MySQL, it takes non numeric strings and turns them into zero. So when you test name = 0, it converts name into an integer and turns that into 0. Consequently strings which can’t be parsed as an integer result in true for this test.

At this point I would rant about how mindbogglingly stupid this behaviour is, but I don’t think I can really be bothered.

mccraig

mysql cast to floating point

By craig mcmillan on January 14th, 2009

discovered another mysql trick

if you are experiencing underflow with mysql fixed point arithmetic, you may need to force a floating point evaluation. cast() does not support cast to floating point so multiply by 1e0 instead

e.g.

select  1000000000000000000000000000000 *  ( 1 / ( 1000000000000000000000000000000 ));

returns the wrong answer, whereas : 

select  1000000000000000000000000000000 *  ( 1 / ( 1e0 * 1000000000000000000000000000000 ));

is fine and dandy

jan

fixtures are evil, but so is mysql

By Jan Berkel on December 9th, 2008

MySQL is not getting much love at the office, today was another of those days.

A little bit of background: we were in the process of replacing our fixture-based rails specs with “rspec scenarios”, a small extension we wrote for rails/rspec (to be released soon). The idea is that you create a scenario programatically rather than have static, hard to change fixtures in yaml. Each spec is run inside a transaction which gets rolled back, in the same way Rails handles this.

One particular spec was leaving the database in a inconsistent state, i.e. a transaction got committed. Debugging this problem took more time that I’m willing to admit here, but some of it was spend making the process a bit easier, using a logfilter for mysql:

It expects mysqld.log from stdin and will print logging output from separate transactions in different colours, as well as highlight the transaction demarcation points. However, everything looked ok in there, the transaction was properly demarcated + rolled back but still ended up being committed!

It turns out that some sql statements perform an implicit silent commit, effectively ignoring your defined boundaries. In our case TRUNCATE table was the culprit. The right behaviour here seems to either roll back the current transaction or at least produce some informational logging as to why the transaction got committed. The default behaviour just seems to be completely wrong and unintuitive (and cannot be disabled).

david

Pearsons in the database, part 2

By David MacIver on December 9th, 2008

Before I explain what this is about, the following tweets provide useful context for how I feel about this:

http://twitter.com/DRMacIver/status/1047320819

http://twitter.com/DRMacIver/status/1047321174

We’ve been discovering that the SQL query I wrote for calculating pearsons, while it works fine for small datasets (say a few hundred thousand ratings), once you get to around the million rating mark starts being unusably slow, even for a nightly job. Basically if you look at the query plan it ends up doing a filesort. This is not nice, as it involves an awful lot of data paging to and from disk. Having tried to optimise it directly and failed, I’ve spent the last two weeks writing nasty hacks to try to make it fast and was going to follow up to the blog post with my final solution (which involves a temporary table and a stored procedure. It’s pretty grim).

But today I was doing something else which involved a similar sort of query and got really pissed off that I was having exactly the same problems. I boiled it down to a very simple example which illustrated it and logged onto freenode’s #mysql to see if anyone could help me figure out what’s going on. Last time I tried this I was not successful, but one lives in hope.

Well, as it turns out, someone could. Here’s a snippet of conversation:

16:20 < DRMacIver> I'm finding this pattern comes up a lot in what I'm doing at
                   the moment, and I simply can't figure out a sensible way to
                   optimise it. Any suggestions? http://pastebin.com/m5be98ace
16:21 < DRMacIver> The self join on the same column followed by a group by seems
                   to produce a filesort no matter what I do. As per example,
                   basically all indices that could exist on the table do.
16:22 < jbalint> DRMacIver: i think you can order by null
16:22  * DRMacIver boggles
16:22 < DRMacIver> That works. Thank you.

So, by updating one line in the SQL I’ve posted previously the query becomes dramatically faster. I’ve also updated it to use an update join (which is MySQL specific) instead of the dependent subquery in the set (which is not, but which MySQL runs appallingly slowly) in order to get the standard deviation calculations to run in reasonable time.

Why is this happening? Well, http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html contains the answer. It turns out that if you have a group by then it implicitly orders by that group. Even if the group clause has no index on it (which it can’t in this case because it’s a join across two tables), and ordering a large dataset by a non-indexed key will cause a filesort. If you add the clause “order by null” then it will not order by the group by clause, and the filesort goes away and the query becomes much faster.

We are not amused.

As another general tip, we’ve also found that this rapidly starts to generate unacceptably large amount of data, but that if you set some sensible lower bounds (such as only generating the statistics for things which cooccur more than once and have a pearsons > 0.1) it can easily be reduced to sensible levels.

david

Computing connected graph components via SQL

By David MacIver on November 19th, 2008

Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.

As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason - there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.

Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.

There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.

It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.

Consider the following setup:

create table if not exists items(
  id int primary key,
  component_id int
);

create table if not exists links(
  first int references items (id),
  second int references items (id)
);

We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.

We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):

create table if not exists components_to_merge(
  component1 int,
  component2 int);

(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).

The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.

So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.

Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:

insert into components_to_merge
            select distinct t1.component_id c1, t2.component_id c2
            from links
            join items t1
            on links.first = t1.id
            join items t2
            on links.second = t2.id
            where t1.component_id != t2.component_id

insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity

So, now we have a list of groups to merge in this table. If the table is empty then we’re done - all the groups are maximal (and because of the way we constructed them they’re connected - at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:

    update items
    join (
      select component1 source, min(component2) target
      from components_to_merge
      group by source
    ) new_components
    on new_components.source = component_id
    set items.component_id = least(items.component_id, target)

This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).

What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.

I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine - it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.

Anyway, here’s some sample code that ties all of this together: http://code.trampolinesystems.com/components.rb