Blog | Archive for the ‘Coding’ Category
Pearsons in the database, part 2
By david | Tuesday, December 9th, 2008
Before I explain what this is about, the following tweets provide useful context for how I feel about this:http://twitter.com/DRMacIver/status/1047320819
http://twitter.com/DRMacIver/status/1047321174
We’ve been discovering that the SQL query I wrote for calculating pearsons, while it works fine for small datasets (say a few hundred thousand ratings), once you get to around the million rating mark starts being unusably slow, even for a nightly job. Basically if you look at the query plan it ends up doing a filesort. This is not nice, as it involves an awful lot of data paging to and from disk. Having tried to optimise it directly and failed, I’ve spent the last two weeks writing nasty hacks to try to make it fast and was going to follow up to the blog post with my final solution (which involves a temporary table and a stored procedure. It’s pretty grim).
But today I was doing something else which involved a similar sort of query and got really pissed off that I was having exactly the same problems. I boiled it down to a very simple example which illustrated it and logged onto freenode’s #mysql to see if anyone could help me figure out what’s going on. Last time I tried this I was not successful, but one lives in hope.
Well, as it turns out, someone could. Here’s a snippet of conversation:
16:20 < DRMacIver> I'm finding this pattern comes up a lot in what I'm doing at
the moment, and I simply can't figure out a sensible way to
optimise it. Any suggestions? http://pastebin.com/m5be98ace
16:21 < DRMacIver> The self join on the same column followed by a group by seems
to produce a filesort no matter what I do. As per example,
basically all indices that could exist on the table do.
16:22 < jbalint> DRMacIver: i think you can order by null
16:22 * DRMacIver boggles
16:22 < DRMacIver> That works. Thank you.
So, by updating one line in the SQL I’ve posted previously the query becomes dramatically faster. I’ve also updated it to use an update join (which is MySQL specific) instead of the dependent subquery in the set (which is not, but which MySQL runs appallingly slowly) in order to get the standard deviation calculations to run in reasonable time.
Why is this happening? Well, http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html contains the answer. It turns out that if you have a group by then it implicitly orders by that group. Even if the group clause has no index on it (which it can’t in this case because it’s a join across two tables), and ordering a large dataset by a non-indexed key will cause a filesort. If you add the clause “order by null” then it will not order by the group by clause, and the filesort goes away and the query becomes much faster.
We are not amused.
As another general tip, we’ve also found that this rapidly starts to generate unacceptably large amount of data, but that if you set some sensible lower bounds (such as only generating the statistics for things which cooccur more than once and have a pearsons > 0.1) it can easily be reduced to sensible levels.
Posted in Coding | 1 Comment »
rails 2.2 + jruby + jetty = win
By jan | Thursday, November 27th, 2008
In case you missed it, rails 2.2 recently got released, finally promising thread safety among some other things. Thread safety has always been neglected by the rails core team, the standard way to scale up in rails (pre 2.2) is to run multiple processes, which makes deployment a lot harder (I think there’re at least 10 different ways to deploy rails apps at the moment, and people still come up with new solutions: apache+fcgi, mongrel, mongrel_cluster, thin, phusion, rack…).Why has thread safety become a priority all of a sudden? I suspect one of the drivers is JRuby, which is now a viable alternative to MRI Ruby, and which also has the nice property of mapping Ruby threads to native threads. Another factor might be the arrival of merb, the new kid on the ‘ruby web framework’ block. Merb has been designed with thread safety in mind, and is now starting to get a lot of attention (1.0 has just been released).
Now, with a thread safe rails JRuby might become the platform of choice for deploying rails apps, especially given the performance progress the JRuby team is making. Having real threads does make a huge difference, reducing the memory footprint and making better use of multi core cpus.
There’re a couple of possibilites to deploy a rails application in JRuby, glassfish seems to be the recommended choice at the moment. However glassfish is anything but easily embeddable so I tried jetty as an option. Compared to glassfish, jetty is solid and proven (version 7 will be released soon), small and easily embeddable.
I didn’t want to use warbler (no web.xml please!), instead I used a combination of JRuby-Rack + Jetty7 and tied everything together with a simple JRuby script.
server = org.mortbay.jetty.Server.new
thread_pool = org.mortbay.thread.QueuedThreadPool.new
thread_pool.min_threads = 5 # adjust as needed
thread_pool.max_threads = 50
server.set_thread_pool(thread_pool)
connector = org.mortbay.jetty.nio.SelectChannelConnector.new
connector.port = 3000
context = org.mortbay.jetty.servlet.Context.new(nil, "/",
org.mortbay.jetty.servlet.Context::NO_SESSIONS)
context.add_filter("org.jruby.rack.RackFilter", "/*",
org.mortbay.jetty.Handler::DEFAULT)
context.set_resource_base(RAILS_DIR)
context.add_event_listener(org.jruby.rack.rails.RailsServletContextListener.new)
context.set_init_params(java.util.HashMap.new(
'rails.root'=> '.', 'public.root' => 'public',
'org.mortbay.jetty.servlet.Default.relativeResourceBase' => '/public',
'jruby.max.runtimes' => '1'))
context.add_servlet(org.mortbay.jetty.servlet.ServletHolder.new(
org.mortbay.jetty.servlet.DefaultServlet.new), "/")
server.set_handler(context)
server.start
This will run jetty on port 3000, dispatching all requests for dynamic content to a single JRuby instance. It is important to set “‘jruby.max.runtimes” to 1, so it’ll create a shared application runtime for you, otherwise you’ll get the old one runtime per thread model.
On the rails side you need “config.threadsafe!” in the configuration file. Autoloading of classes will then be disabled, be sure to load all your dependencies upfront in environment.rb. We haven’t actually used this in production, but some initial tests look very promising (mongrel: 23req/s, jetty: 50 req/s). Also, deployment will be a lot easier, because static and dynamic content can be served by one single process.
Posted in Coding | 26 Comments »
Computing connected graph components via SQL
By david | Wednesday, November 19th, 2008
Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason – there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.
Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.
There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.
It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.
Consider the following setup:
create table if not exists items( id int primary key, component_id int ); create table if not exists links( first int references items (id), second int references items (id) );
We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.
We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):
create table if not exists components_to_merge( component1 int, component2 int);
(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).
The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.
So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.
Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:
insert into components_to_merge
select distinct t1.component_id c1, t2.component_id c2
from links
join items t1
on links.first = t1.id
join items t2
on links.second = t2.id
where t1.component_id != t2.component_id
insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity
So, now we have a list of groups to merge in this table. If the table is empty then we’re done – all the groups are maximal (and because of the way we constructed them they’re connected – at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:
update items
join (
select component1 source, min(component2) target
from components_to_merge
group by source
) new_components
on new_components.source = component_id
set items.component_id = least(items.component_id, target)
This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).
What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.
I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine – it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.
Anyway, here’s some sample code that ties all of this together: http://code.trampolinesystems.com/components.rb
Posted in Coding | 12 Comments »
removing global fixtures for ruby tests
By mccraig | Wednesday, July 2nd, 2008
global fixtures are evil, but we’ve got a bunch of unit tests depending on them, so we still need them aroundhere’s a neat [and generally fast, though a degenerate O(#tables^2) case is possible] way of deleting all fixtures without invoking db dependent ways of ignoring foreign-key constraints, and without loading all the objects into memory :
classes = ActiveRecord::Base.connection.tables.map {|t|
t.singularize.camelize.constantize rescue nil
}.compact.reject{|cls| !cls.ancestors.include?(ActiveRecord::Base)}
while classes.size > 0
classes = classes.select{|c|
begin
c.delete_all
false
rescue
true
end
}.reverse!
end
Posted in Coding | 1 Comment »
Open Visualisation Workshop at the Trampery
By jan | Wednesday, May 21st, 2008
Trampoline Systems is hosting a visualisation workshop this coming Saturday, 24th May, organised by the Open Knowledge Foundation. Come along if you’re interested in open source visualisation technologies (Prefuse, Flare etc.). The goal is to have a very informal setting to talk about various aspects of visualising data. Find out more in the official announcement.Hope to see you here!
Posted in Coding | No Comments »
Creating DMG Files Without MacOS X
By jon | Monday, May 19th, 2008
I’ve put together a script for creating DMG files without using OS X…it requires Linux, I’ve tested it on Kubuntu 7.10 but it should work on anything recent.
Run the following commands:
# This gets and builds a patched version of Apple's diskdev_cmds package which will work on Linux wget http://www.mythic-beasts.com/resources/appletv/mb_boot_tv/diskdev_cmds-332.14.tar.gz wget http://www.ecl.udel.edu/~mcgee/diskdev_cmds/diskdev_cmds-332.14.patch.bz2 tar xzf diskdev_cmds-332.14.tar.gz bunzip2 -c diskdev_cmds-332.14.patch.bz2 | patch -p0 cd diskdev_cmds-332.14 make -f Makefile.lnx # Create symlinks to the mkfs and fsck commands for HFS+ sudo cp newfs_hfs.tproj/newfs_hfs /sbin/mkfs.hfsplus sudo cp fsck_hfs.tproj/fsck_hfs /sbin/fsck.hfsplus # Get and enable the hfsplus kernel module sudo apt-get install hfsplus sudo modprobe hfsplus Now that's done, you can use the following handy bash script (must be run as root) I've written to create a DMG file which contains the contents of a directory you specify on the command line. #!/bin/bash # DMG Creation Script # Usage: makedmg <imagename> <imagetitle> <imagesize (MB)> <contentdir> # # imagename: The output file name of the image, ie foo.dmg # imagetitle: The title of the DMG File as displayed in OS X # imagesize: The size of the DMG you're creating in MB (Blame Linux for the fixed size limitation!!) # contentdir: The directory containing the content you want the DMG file to contain # # Example: makedmg foo.dmg "Script Test" 50 /home/jon/work/scripts/content # # Author: Jon Cowie # Creation Date: 02/04/2008 if [ ! $# == 4 ]; then echo "Usage: makedmg <imagename> <imagetitle> <imagesize (MB)> <contentdir>" else OUTPUT=$1 TITLE=$2 FILESIZE=$3 CONTENTDIR=$4 USER=`whoami` TMPDIR="/tmp/dmgdir" if [ ${USER} != "root" ]; then echo "makedmg must be run as root!" else echo "Creating DMG File..." dd if=/dev/zero of=${OUTPUT} bs=1M count=$FILESIZE mkfs.hfsplus -v "${TITLE}" ${OUTPUT} echo "Mounting DMG File..." mkdir -p ${TMPDIR} mount -t hfsplus -o loop ${OUTPUT} ${TMPDIR} echo "Copying content to DMG File..." cp -R ${CONTENTDIR}/* ${TMPDIR} echo "Unmounting DMG File..." umount ${TMPDIR} rm -rf ${TMPDIR} echo "All Done!" fi fi
Hope it’s useful!
Tags: dmg, Mac, os x, software
Posted in Coding | 6 Comments »
@media Ajax 2007
By mike | Wednesday, September 5th, 2007
I have the honour and terror of presenting at @media Ajax on home turf this November. It’s a privilege to be speaking alongside the likes of Brendan Eich (creator of Javascript), Douglas Crockford (inventor of JSON), John Resig (JQuery lead) and about a dozen other top dogs.In a lineup like that I clearly can’t talk about nuts and bolts Javascript. Instead I’m taking a slightly unusual tack for me: revelations. Since Ajax came along my job has changed in ways I wouldn’t have predicted. Technically I’m a flavour of designer yet after many years of specialising I’ve found myself having to skill up again.
- To keep a handle on what the rest of the team produce I’ve become a testing fanatic;
- I’ve had to go back and relearn how to program – not to necessarily produce back-end code but to understand what the real implications of my design decisions are;
- I’ve been converted to Agile practices as a means of effective collaboration.
None of these things are traditionally within the remit of ‘design’ but they all feed into producing a successful app. To try and describe these changes and what I’ve done about them I will be presenting But I’m a Bloody Designer! on the first day, straight after the keynote by the Ajaxians.
So, the lineup’s great, it’s in London. @media Ajax: coming soon. Say hello if you decide to come…
Posted in Coding | 1 Comment »
Springy 0.3 released
By jan | Thursday, August 2nd, 2007
No big changes this time, mainly compatibility fixes for JRuby 1.0. It is now also possible to build the project using Maven, for those too afraid to use rake. Documentation and code for springy are available here.I’m also happy to announce that Craig Walls, the author of “Spring in Action”, is going to talk about Springy as part of his “Spring Cleaning: Tips for Managing XML Clutter” talk at this year’s No Fluff Just Stuff series of events as well as the Spring Experience 2007 in Florida.
Posted in Coding | 2 Comments »
Tracing file access on Mac OSX
By Tomasz | Tuesday, July 24th, 2007
I just started working at Trampoline Systems yesterday. Some of you might know my blog already (the one with kitten pics and programming rants). The first thing I did was configuring all the software on my new MacBook. Most of it went all right without any problems, but a few gems didn’t want to install, complaining aboutjni.h missing. jni.h was on the box, so the problem was basically that the gems were looking for it in a wrong place. Problems like that can usually be solved by strace -e trace=file gem install whatever | grep 'jni.h' and a symlink. Not this time, because Macs does have strace. There’s something called ktrace, but it didn’t seem to provide information I needed.
Full strace may be impossible without heave kernel hacking, but it’s not hard to write a simple file access tracing utility. The one I coded uses LD_PRELOAD-like trick to override open and stat functions in libc. These two usually provide most of the usuful information about file access, and if anything more was needed the utility can be easily extended.
The library is very simple:
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dlfcn.h>
#include <errno.h>
#include "errno_names.h"
/* Probably not thread-safe. I have no idea what's thread-interpretation of
dlsym(RTLD_NEXT, ...), even assuming atomic assignment to variables
*/
int (*real_open)(const char *path, int flags, mode_t mode) = 0;
int (*real_stat)(const char *path, struct stat *sb) = 0;
void report_errno(int retval)
{
if(retval == -1 && errno >= 0 && errno <= ELAST) {
fprintf(stderr, " (%s)n", errno_names_table[errno]);
} else {
fprintf(stderr, "n");
}
}
int open(const char *path, int flags, mode_t mode)
{
int retval;
if(real_open == 0)
real_open = dlsym(RTLD_NEXT, "open");
retval = real_open(path, flags, mode);
fprintf(stderr, "open("%s", ...) = %d", path, retval);
report_errno(retval);
return retval;
}
int stat(const char *path, struct stat *sb)
{
int retval;
if(real_stat == 0)
real_stat = dlsym(RTLD_NEXT, "stat");
retval = real_stat(path, sb);
fprintf(stderr, "stat("%s", ...) = %d", path, retval);
report_errno(retval);
return retval;
}
Fist of error names in errno_names.h is generated by Rakefile, which looks like that (stripping the testing-related parts):
desc "Build trace_file_access library"
task :build => "libtfa.dylib"
file "libtfa.dylib" => ["errno_names.h", "libtfa.c"] do
sh "gcc", "-O6", "libtfa.c", "-dynamiclib", "-fPIC", "-o", "libtfa.dylib"
end
file "errno_names.h" do
errnos = []
File.open("/usr/include/sys/errno.h").each{|line|
errnos[$2.to_i] = $1 if line =~ /A#defines+(E[A-Z]+)s+(d+)/
}
File.open("errno_names.h", "w") {|fh|
fh.puts "/*n * This code is automatically generated by the Rakefilen * Do not modify by handn */"
fh.puts "static const char *errno_names_table [] = {"
errnos.each_with_index{|name, i|
fh.puts " "#{name}", /* #{i} */"
}
fh.puts "};"
}
end
Finally trace_file_access command script sets up the tracing, with the results looking like this: #!/usr/bin/env ruby
ENV["DYLD_FORCE_FLAT_NAMESPACE"] = "1"
ENV["DYLD_INSERT_LIBRARIES"] = "./libtfa.dylib"
exec *ARGV
$ ./trace_file_access ruby -e 'require "rational"'
open("/dev/urandom", ...) = 3
stat("/opt/local/lib/ruby/site_ruby/1.8/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/1.8/rational.bundle", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/1.8/i686-darwin8.9.3/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/1.8/i686-darwin8.9.3/rational.bundle", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/rational.bundle", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/1.8/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/1.8/rational.bundle", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/1.8/i686-darwin8.9.3/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/1.8/i686-darwin8.9.3/rational.bundle", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/rational.bundle", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/1.8/rational.rb", ...) = 0
open("/opt/local/lib/ruby/1.8/rational.rb", ...) = 3
stat("/opt/local/lib/ruby/site_ruby/1.8/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/1.8/i686-darwin8.9.3/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/site_ruby/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/1.8/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/1.8/i686-darwin8.9.3/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/vendor_ruby/rational.rb", ...) = -1 (ENOENT)
stat("/opt/local/lib/ruby/1.8/rational.rb", ...) = 0
open("/opt/local/lib/ruby/1.8/rational.rb", ...) = 3
open("/opt/local/lib/ruby/1.8/rational.rb", ...) = 3
open("/opt/local/lib/ruby/1.8/rational.rb", ...) = 3
open("/opt/local/lib/ruby/1.8/rational.rb", ...) = 3
Posted in Coding | 2 Comments »
Java and Rails integration with GoldSpike
By jan | Friday, July 13th, 2007
While trying to create a unified testing framework (shared between Rails and our Java backend code) I came across ActiveRecordJDBC which is an adapter to use JDBC drivers with JRuby on Rails. It works fine, although it can be a bit complicated to get a DRY database.yml configuration. The goal is to get rid of our dbunit/manually crafted database tests on the Java side by using ActiveRecord fixtures. After some research I found out about the Rails integration project (now called GoldSpike), which tries to make it easy to deploy a Rails app on a Java servlet container such as jetty. As far as I know Thoughtworks uses this approach to deploy their new product, Mingle. GoldSpike is under constant development but it is already usable, although a few patches were required. After everything was set up, a simple
$ rake war:standalone:create
Reading user configuration
Assembling web application
Adding Java library commons-pool-1.3
Adding Java library rails-integration-1.1.1
Adding Java library activation-1.1
Adding Java library mysql-connector-java-5.0.5
Adding Java library bcprov-jdk14-124
Adding Java library jruby-complete-1.0
Adding web application
Adding Ruby gem ActiveRecord-JDBC version 0.4
Creating web archive
creates a .war file which can be directly deployed to a container. I’ve never been a big fan of war files, but in the context of JRuby+Rails it makes perfect sense, because Ruby and all the required gems can be packaged up in a single file. No need to worry about missing gems, C bindings or wrong versions of the installed software, all you need to deploy is Java and jetty (which is pretty lightweight).
Not quite sure what the implications of this are, but it seems like a good alternative to mongrel/CRuby deployments. Performance-wise it looks good, too, though we haven’t done any benchmarking. GoldSpike doesn’t have a project homepage yet, but you can find it on the jruby-extras project page. Let’s wait and see what this will mean for the adoption of Rails in corporate environments (Oracle is currently looking for Rails developers to join the Enterprise 2.0 team, for example).
Posted in Coding | No Comments »








