Dumb programming techniques to avoid…

This is somewhat of a post in frustration while I wait for my computer to finish what’s turned out to be a far more technically challenging problem than I thought!

In Trunk.ly we want to add several new features, but the ones I’m working on are “public” tags. The idea here is that you can see every link tagged with the same word, e.g. Show me all links tagged with “design”.

The first piece of this is fairly straight forward — loop through all the links we’ve stored, identify the same links, collect the tags used and aggregate these and count them. Not too bad.

We end up with a URL object collection in which entries look like this (we’re using MongoDB, so this is a BSON / JSON object):

{ “l” : “http://example.com", “t” : “An example link”, “tags” : [ { “tag” : “example”, “c” : 200 }, {“tag” : “code”, “c” : 103}, …]}, {…} ] }

That was reasonably trivial, the next challenge was to create lists of related tags. So when you are looking at the tag ‘code’, the related tags might be something like ‘python’, ‘javascript’, ‘ruby’ etc.

In theory it sounds simple enough. Take a collection of tags from a URL, then for each tag, store it with the related tags. Of course it gets more complicated quickly — for example, we need the count for the related tags because we want to show the strongest matches.

The end goal is something that looks like this:

{ ‘tag’ : code, ‘rt’ : [ { ‘tag’ : ‘python, ‘c’ : 324}, {‘tag’ : ‘javascript’, ‘c’ : 243}, {…} ] }

Then we can just look up the related tag, find the top related tags and display them.

Oh boy! This has been a three day nightmare.

The biggest problem to understand is that if I tag something python, ruby, javascript, php, I have to store the following four collections:

php has related tags python, ruby, javascript

ruby has related tags python, php, javascript

python has related tags ruby, php, javascript

javascript has related tags python, php, ruby

Some collections of tags number into the 50–100’s, and that’s only on one URL — I also need to aggregate across the URLs so sometimes there could be 1000’s. The first approach was to write this to update the database directly.

  1. Build the related tag collection for the tag.
  2. Look up the tag in the database. If it exists, increment the count for each related key word.
  3. If it didn’t exist, write out each related key word.

Not memory intensive, but oh man, slow! Estimated time to complete ~2 weeks.

My second attempt, read it all into memory. This was much much quicker.

  1. Grab all the tags from the URLs into an array.
  2. Update an in memory dictionary with an embedded dictionary for the related tags for each entry, incrementing the counts as we go.
  3. Write the lot out to disk.

This was quite fast for the fast 50K or so entries into the dictionary, after that it ate all the memory on my PC (8Gb) and hung up. I’ve noticed it’s getting “worse” the further into the DB we go too, turns out there are LOTs more tags in the back end of the database. This update ran for about 18 hours before it had clearly hung with the PC utilising 100% of RAM and nothing moving.

Anyway, third attempt — and the best to date! Combine the two approaches. Read in batches of URLs, then create the tag collections and cache them to disk. MongoDB has some cool functionality to help with this — I just use an iterator and then do something like this:

db.tags.related[‘c%s’ % iter].insert()

I end up with db.tags.related.c1, db.tags.related.c2 etc.

The point is these are already “pre-processed” for each tag. Now I can just shrink them down to a single collection by simply querying the collections, finding the matching tag in each one and their related tags, then just combine into a single tag and write it out.

This first part of the process is taking around an hour to complete and I expect the second part to take an hour as well.

What have I learnt so far?

  1. It’s much easier to grow than retrofit on big-data.
  2. Explicitly release memory — there seems to be a difference in Python between dictionary.clear() Vs. what I was doing which was dictionary = {} (a new empty dictionary). Overtime the second approach used all the memory on the PC, even when I was doing smaller batches. Using clear, there are no memory leaks (or no obvious ones anyway).
  3. Memory is more efficient, to a point — Python dictionaries seem to tail off (maybe with the dictionaries within dictionaries). Trying to do everything in memory just wasn’t working. Perhaps more memory would help, but caching to disk part way through offers a lot of advantages too (not least I can restart from certain points — blow away the last cache and start from there).