A SPARQLing Benchmarking Adventure

800px-brachydanio_rerio

As you can see from the pile of triple store/RDBMS related posts below, I’ve recently moved out of my comfort zone to explore a new territory: Linked data, SPARQL, and OBDA (Ontology-Based Data Access). Last year, the FishDelish project, which was steered by researchers at the Manchester University, created a linked data version of FishBase, a large database containing information about most of the world’s fish species (around 30,000). Access to such a large amount of (nice and real) data offered a good opportunity for further usage, and so we set out to generate a cross-system performance benchmark using the FishBase data and queries. While the resulting paper (which I co-authored with Bijan Parsia, Sandra Alkiviadous, David Workman, Rafael Goncalves, Mark Van Harmelen, and Cristina Garilao) wasn’t nearly as comprehensive as I had wished, I did learn a lot on the way which didn’t make it into the paper. Nevertheless, here’s a few thoughts about performance benchmarking of data stores, including a wish list for my “ideal benchmarking framework”.

Performance benchmarking in Java: It’s complicated.

Measuring execution time of Java code in Java code is known to be tricky when you’re moving in sub-second territory. The JVM requires special attention, such as a warm-up phase and repeated measurements to take into account garbage collection. A lot has been written about this topic, so I shall refer you to this excellent post on “Robust Java Benchmarking”  by Brent Boyer. On my wish list goes a warm-up phase which runs until the measurements are stabilised (rather than a fixed number of runs).

Getting the test data & queries

That’s an interesting one. There seem to be two kinds of SPARQL benchmarks: Those that use an existing dataset and fixed queries, taken from a real-world application, perhaps with some method of scaling the data (e.g. the DBpedia benchmark). And then there are benchmarks which artificially generate test data and queries based on some “realistic” application (e.g. LUBM, BSBM). Either way, we are tied to the data (of varying size) and queries. For our paper (and further, for Sandra’s dissertation), we tried to add another option to this mix: A framework that could turn any kind of existing dataset into a benchmark for multiple platforms. 

The framework (we called it MUM-benchmark, Manchester University Multi-platform benchmark) requires three things: A datastore (e.g. a relational DB) with the data, a set of queries, and a query mix. Each query is made up of a) a parameterised query (i.e. a query which contains one or more parameters) and b) a set of queries to query the database and obtain parameter values. In our implementation, the queries are held in a simple XML file – one for each query type (e.g. SPARQL, SQL). If there is an existing application for the data, the parameterised queries can simply be taken from the most frequently executed queries. In the case of FishBase, for example, we reverse-engineered queries to query for a fish species by common name, generate the species page, etc.

Additionally, I hacked BSBM to work with various datastores and added a standard SQL connection and an OBDA connection. While we have only tested our framework with the Quest OBDA system (with a FishBase ontology written by Sandra), this should work for all other OBDA systems, too (and if not, it’s fairly straightforward to add another type of connection).

One aspect which we haven’t had the time to implement is scaling the FishBase data by species. Ideally, we want a simple mechanism to specify the number of species we want in our data and get a smaller dataset. If we take this one step further, we could also artificially generate species based on heuristics from the existing data in order to increase the total number of species beyond the existing ones.

To my wish list, I add cross-platform benchmarks, generating a benchmark from existing data, scalable datasets, and easy extension by additional queries.

What to measure?

Query mixes seem to be the thing to go for when benchmarking RDF stores. A query mix is simply an ordered list of (say, 20-25) query executions which emulates “typical” user behaviour for an application (e.g. in the “explore use case” of BSBM: find products for given features, retrieve information about a product, get a review, etc.) This query mix can either be an independent list of queries (e.g. the parameter values for each query are independent of each other) or a sequence, in which the parameter value of a query depends on previous queries. As the latter is obviously a lot more realistic, I shall add it to my wish list.

For the FishDelish benchmark, we were kindly given the server logs for one month’s activity on one of the FishBase servers, from which we generated a query mix. It turned out that on average, only 5 of the 24 queries we had assembled were actually used frequently on FishBase, while the others were hardly seen at all (as in, 4 times out of 30,000 per month). Since it was not possible to include these into the query mix without deviating significantly from reality, we generated another “query mix” which would simply measure each query once. As the MUM-benchmarking framework wouldn’t do sequencing at the time, there was no difference between a realistic query mix and a “measure all queries once” type mix.

Finally, the third approach would be a “randomised weighted” mix based on the frequency of each query in the server logs. The query mix contains the 5 most frequent queries, each instantiated n times, where is the (hourly, daily) frequency of the query according to the server access logs.

How to measure!?

Now we’re back to the “robust Java benchmarking” issue. It is clear that we need a warm-up phase until the measurements are stabilised, and repeated runs to obtain a reliable measurement (e.g. to take into account garbage collection which might be triggered at any point and add a significant overhead to the execution time).

In the case of the MUM-benchmark, we generate a query set (i.e. “fill in” parameter values for the parameterised queries), run the query mix 50 times as a warm-up, then run the query mix several hundred times and measure the execution time. This is repeated multiple times with distinct query sets (in order to avoid bias caused by “good” or “bad” query parameter values). As you can see, this method is based on “run the mix x times” rather than “complete as many runs as you can in x minutes (or hours)”. This worked out okay for our FishBase queries, as the run times were reasonably short, but for any measurements with significantly longer (or simply unpredictable) execution times, this is completely impractical. I therefore add “give the option to measure runs per time” (rather than fixed number) to my wish list.

The results

This was something I found rather pleasant about the BSBM framework. The benchmark conveniently generates an XML results file for each run, with summary metrics of the entire query mix, and metrics for each individual query. As our query mix was run with different parameters, I added the complete query string to the XML output (in order to trace errors, which came in quite handy for one SPARQL query where the parameter value was incorrectly generated). The current hacky solution generates an XML file for each query set, which are then aggregated using another bit of code – eventually the output format should be a little more elegant than dozens of XML files (and maybe spit out a few graphs while we’re at it).

Conclusions

While modifying the BSBM framework I put together the above “wish list” for benchmarking frameworks, as there were quite a few things that made performing the benchmark unnecessarily difficult. So for the next version of the MUM-benchmarking framework, I will take these issues into account. Overall, however, the whole project was extremely interesting – setting up the triple stores, generating the queries, tailoring (read: hacking) BSBM to work across multiple platforms (a MySQL DB, a Virtuoso RDF store, a Quest OBDA system over a MySQL db) and figuring out the query mixes.

Oh. And I learned a lot about fish. The image shows a zebrafish, which was our preferred test fish for the project.

[cc-licensed image by Marrabio2]

JavaScript for Cats

We all know the internet was invented for the sole purpose of sharing funny images and videos of cats, right? So it’s about time our furry friends (and perhaps their humans, too) learn a little bit about coding. JavaScript for Cats introduces the basic concepts of JavaScript with some simple examples and good explanations. And cat jokes, of course.

It’s easy to code along to the tutorial, as it simply use the Google Chrome browser’s JavaScript console. The site is in its early stages and currently contains only the very basics of JS, but it looks very promising, and there are a few links to other great JS resources at the bottom.

Go to “JavaScript for Cats”

Drink! It’s for charity!*

The Black Lion pub in Salford has been around for over 130 years, and according to their website every famous person in the history of fame has already enjoyed a tipple there. Unfortunately the pub was broken into last night – here’s the email I just received:

Last night the Black Lion was broken into, 3 youths smashed through a triple bolted front door and then smashed up a few shelves before making off with over £1000 worth of spirits and a small safe under the bar.

They did this and then stole the Help for Heroes Official charity pots we have on the bar, which had a hundred odd quid in it from our generous customers – as a small social enterprise this is gutting for us, and watching it on CCTV made us all sick (esp when they ripped the H4H pots from the bar).

Our insurance company said they would not pay out as its not worth it, our excess is over £1000 and our premium would go up, already this month we have had to battle Salford city council on business rates and enterprise, the owners of the building, have put beer up! – this is hard for us… we need your support.

If you are out drinking tonight or this weekend, please pop into the The Black Lion and help boost the morale of the staff and help us build the business back up, we live week to week! To loose £1000 like that could cost jobs :(- what hurts the most is the charity pots and the recklessness of these youths, one year after the riots, please share and support your local pub in an hour of need:

Black Lion, Chapel Street, Salford, M35BZ
http://blacklionsalford.tumblr.com
– please share this and RT where possible –

So, you know what to do, right? Drink! It’s for charity!

* Working title of this post: Drinker, drink faster!

[Picture by Robert Wade]

MySQL workbench on Mac OS “Error 1046 – No database selected”

While playing around with the triple stores and a MySQL db on our Macs, I ran into a little problem when trying to change a column name in one of the tables in the MySQL database. The error message I got from the MySQL workbench was mildly confusing:

Error 1005: Can’t create table ‘families’. (errno: 13) […]

Error 1046: No database selected

Since the query was automatically generated by the MySQL workbench, I presumed it had to be correct. A bit of googling, and this friendly chap’s forum post helped me find the solution to the problem: Access rights. As always. Here’s how you allow the MySQL user to write to the MySQL directory:

Find the name and group of the MySQL user in /etc/group and /etc/passwd respectively:

less /etc/group
less /etc/passwd

On our Mac OS X 10.7 Mac mini with a default MySQL 5.5 install, the user name and group is _mysql. 

Then set the owner and mode of the MySQL install directory:

cd /usr/local
sudo chown -R _mysql:_mysql mysql-5.5.25-osx10.6-x86_64/
sudo chmod -R 755 mysql-5.5.25-osx10.6-x86_64/

This may be obvious, but make sure you chown/chmod the actual MySQL directory, not the symlink in /usr/local which is named mysql.

That’s it – you should be able to modify your database now.

How I went from Manchester to Sicily and back – via Bury

Got me one of them fancy retro picture apps on my phone now, all retro stylee here!

Anyway.

One rainy Saturday morning we were working our way through our adventure time stack of leaflets, flyers and maps which we have accumulated over the past year or so, looking for something to do on this rather miserable day. For a fraction of a second, the thick blanket of clouds opened up just about enough to let through a single ray of sun light, lighting up the leaflet I was holding in my hand. That very same moment, the church bells next door started to ring their most beautiful song, and an elating, almost euphoric sensation pulsated through my body. When I looked down at the leaflet, which was still lit up by that single ray of light, I knew we had found our destination for the day: Bury Market.*

And it was… well, big. Very big. A paradise for anyone who really, really needs several pairs of slippers. And meat. Lots of meat. In the food bit, there were fewer fancy food stalls with cake (CAKE.), chocolates, deli stuff, the usual, than I had hoped for, and the few fruit and veg stalls weren’t too convincing. Which, of course, did not stop me from buying my way across the various food stalls at the market. But then, just as I was wandering through a remote corner of the market, trying to find something lunch-able, I had the second epiphany of the day. All of a sudden, I could hear a quiet, friendly voice behind me: “Please… eat this. If you eat here, you will be very, very lucky today!”

“Well, I suppose if the food already starts talking to me, it has to be a lucky day” I thought and turned around. Three faces smiled at me, framed by an array of food and little signs. “We only just opened today, you should really eat something we made… it will be your lucky day!” one of the faces said to me. I quickly scanned the food on offer, just to spot something familiar looking: A small, bread crumb covered ball – an “arancina”, a deep fried risotto ball, which I had just discovered on a trip to Rome the week before. As I am unable to say no when offered food, particularly not by friendly faces, I accepted the offer for food and quickly engaged in a little chat while waiting for the “arancina” to finish its bath in the deep fat fryer. Turned out the stall owners of “La Putia” were incredibly friendly Sicilians with a love and a lot of enthusiasm for food, who were more than happy to talk about Sicilian specialities, Italian food in general, ice cream and tiramisu in particular, and which Italian restaurant in Manchester was the best (apparently none is proper Italian despite the Italian chefs and owners, but San Carlo comes close). I walked away with a delicious little crunchy-creamy risotto and spinach ball and a phone number for home made tiramisu, which happily joined the blocks of cheese, whimberry pie and fancy cordial in my bag. A lucky day indeed!

* In case you’re wondering: the tram to Bury was on time, the tram back into Manchester was massively delayed. That’s 50% of my Metrolink journeys this month delayed, good work TfGM! Oh and, by the way, the new black bus stop signs are ridiculously difficult to spot. Who thought “hey, we’ll design some bus stop signs that blend in smoothly with the urban environment” was a good idea?

Fancy some pussy with your bread rolls?

So, I went to my local Co-op the other day to buy lunch items, when this charming young lady greeted me from the bottom shelf of the newspaper stand:

Seriously? I can just about cope with the ubiquitous boob, but this is a little too much for me.* I don’t even mind that whoever was stacking the newspaper shelves at the Co-op this morning put this at convenient eye level for children; I’m more concerned about the fact that someone at the Sunday Sport looked at the picture and said “Yup, this makes a perfectly acceptable cover image for our paper.” – DUDE! Calling yourself “funny” doesn’t mean you’ve got the license to print smut.

On a side note, I tweeted at the Co-op and they replied straight away and said they were going to “look into it”. Well done and thanks!

* Good use of the empty space between her legs, I’m sure Emma Watson will be pleased to see her face crowned by a vagina.

Installing OpenRDF Sesame on a Mac Mini

And now for the third in a row of triple-store installations. This time it’s Sesame, an open source datastore for RDF and relational data. Thankfully, due to the minimal requirements and the pretty good documentation, the installation was quick and much less painful than expected.

Hardware: Apple Mac Mini (running Mac OS X Lion 10.7), out of the box

I followed mostly the instructions given  on http://www.openrdf.org/doc/sesame2/users/. They explain stuff quite well, so it was actually rather enjoyable to read. You can also find a diagram of the Sesame components, which is helpful. Study and memorise!

sesame architecture

1) Set up environment: Logging

  • Download SLF4J (1.6.6. at time of writing) to get the correct bridge file (slf4j-log4j12-1.6.6.jar) to work with log4j:
  • http://www.slf4j.org/download.html
  • set Java class path to use the log4j bridge jar file: Add the following to the ~/.profile:
  • CLASSPATH=/Users/fishdelish/fishbench/slf4j-1.6.6/slf4j-log4j12-1.6.6.jar
2) Set up Tomcat server

(Sesame doc mentions 5.5 or 6.0, so I went with 6.0 instead of 7.0 just to be on the safe side)

3) Sesame server / workbench installation

>> Workbench is accessible on http://127.0.0.1:8080/openrdf-workbench

Sesame should be up and running now!

The default data directory on Mac OS X is /Users/fishdelish/Library/Application Support/Aduna/OpenRDF Sesame

4a) Create a repository and import RDF data using Sesame console

Create a new store: either in-memory or native. I chose native due to the relatively small RAM on our machines: “The native store uses on-disk indexes to speed up querying.”

In the console, type:

  • create native. (then fill in id and description)
  • open testfish.
  • load /Users/fishdelish/fishbench/testfish.n3.

To exit the console: use exit. or quit.

4a) Create a repository and import RDF data using the Java API

Or do the same using the SesameJava API. Good explanation of the Java API in section 8.2 on http://www.openrdf.org/doc/sesame2/users/ch08.html – I’m just giving you the rough outlines of the code, without error handling etc.

Create repository:

File dataDir = new File("/path/to/datadir/");
Repository myRepository = new SailRepository(new NativeStore(dataDir));
myRepository.initialize();

Import data:

File file = new File("/path/to/example.rdf");
String baseURI = "http://example.org/example/local";
RepositoryConnection con = myRepository.getConnection();
con.add(file, baseURI, RDFFormat.RDFXML);

5) SPARQL query time!

Connect to repository using the Java API:

String sesameServer = "http://example.org/sesame2";
String repositoryID = "example-db";
Repository myRepository = new HTTPRepository(sesameServer, repositoryID);
myRepository.initialize();

Then simply query the Repository() object, as described in the documentation.

That’s it. As with all instructions, I can’t guarantee that it will work correctly, I have yet to stress test my setup as well.

Installing Virtuoso Open Source on a Mac Mini

Part 2 of the “Things PhD students do on a saturday night” series: Having successfully installed 4store on our brand new Mac Mini running OSX 10.7 (Lion), I went on to tackle the next candidate for our triple-store-o-rama: Virtuoso (Open Source Edition).

I followed mostly the instructions on the Virtuoso wiki, which are not quite as nice as the 4store ones, but managed to get me through the installation process without major incidences: http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSMake

A short and clear overview of the installation process can be found on Kingsley Idehen’s blog.

Here we go:

Hardware: Apple Mac Mini (running Mac OS X Lion 10.7), out of the box

Install dependencies

If you’ve previously installed 4store, some of these might already be installed. You’ll also need fink, which I’ve described in the previous post. Using fink install, install the following libs:

  • autoconf
  • automake
  • libtool
  • flex
  • bison (which will also install gawk)
  • gawk
  • gperf
  • m4
  • make
  • OpenSSL

If one of them won’t install, check with fink list pkgname what the alternative package name is and whether it’s already installed. If it’s already installed, this will be indicated by an “i” in the first column of the results that fink list returns.

Install Virtuoso

1) Download Virtuoso Open Source version:
curl -O -L http://downloads.sourceforge.net/project/virtuoso/virtuoso/6.1.5/virtuoso-opensource-6.1.5.tar.gz
(-L is necessary to ensure curl follows the redirect to the respective mirror on SourceForge, took me a while to figure that out…)

2) Unpack the tarball:
tar -xvzf virtuoso-opensource-6.1.5.tar.gz

3) Set compiler flags (check out the Make FAQ for a list of settings on other systems)

  • CFLAGS=”-O -m64 -mmacosx-version-min=10.7″
  • export CFLAGS

4) Configure and install:

  • ./configure
  • make
  • sudo make install (the instructions say it installs to /usr/local/ by default, the resulting path is /usr/local/virtuoso-opensource)

5) Add path to the bin directory to the PATH environment varibale in ~/.profile:

Open text editor and add:
PATH=$PATH:/usr/local/virtuoso-opensource/bin/

Starting Virtuoso and importing data from a file

1) Add directory which contains data file to virtuoso.ini:
sudo emacs /usr/local/virtuoso-opensource/var/lib/virtuoso/db/virtuoso.ini
>> Add the directory path to DirsAllowed parameter, e.g. in our case /Users/fishdelish/fishbench/tests/testfish.n3

2) Start the Virtuoso server:

  • cd /usr/local/virtuoso-opensource/var/lib/virtuoso/db/
  • sudo virtuoso-t -f (or use sudo virtuoso-t -f & if you want to start it independently from the shell you’re using)
  • (virtuoso-t will read the virtuoso.ini file in this directory)

3) Import data:
(see some information and screenshots here:) http://www.proxml.be/users/paul/weblog/3876f/

Connect to DB to get an SQL prompt:

  • isql <HOST>[:<PORT>] -U username -P password
  • or simply isql 1111 myuser mypassword, this connects to the default port 1111

Import data (from n3 format, otherwise use DB.DBA.RDF_LOAD_RDFXML_MT from RDF/XML)

4) Access via http: 

Shutting down the server
Open SQL prompt and use command SHUTDOWN;

When the server isn’t shut down properly, there might be problems starting up next time. Manually removing virtuoso.lck in the virtuoso/db directory can solve this.