I’m in the process of revisiting and migrating a service I’d first written almost five years ago; one of the backend data sources has changed (drastically!) so I’m looking at refactoring as well as streamlining in the process — parsing the new backend takes longer than the old. Part of the reason for doing so is related to (very rare, but still possible) timeouts of the feed with the old, faster version.
Background
I was asked to build a drop-in replacement for a legacy application which merged user information from a telecom system and ldap in order to create a number of feeds which were used by other systems, such as an employee locator, as well as feeding back into ldap so that the telecom system would be the canonical source for certain pieces of user data such as room number, name, and phone extension.
The original software was written in BPL (or PL/B?) and had originally been written decades prior. Other tools grew organically around it. The “owner” of the process retired and I was brought in to replace it.
The first iteration resulted in (mainly due to politics at the time) a jar file which embedded JRuby and a simple Ruby on Rails application. As such the server could be invoked via java -jar
and the feeds were pulled via cron
jobs. Quirky, but it worked. I’ve only had a couple of requests for help with it over that time — the first related to GIGO and the second requiring a reboot of the server (it had been up a very long time). Since it was originally written there have been vast improvements in terms of Java, JRuby, and Rails. I’m pretty surprised it was that rock solid.
Fast forward several years and there is a new telecom system. The old allowed ldap queries to access the data. The new, unfortunately, does not expose the data in this fashion. Rather, it requires a cumbersome and potentially brittle process of logging in via mechanize, walking through a few pages to first generate a CSV file and then download said file. Additionally there are schema differences between the old and new system.
Uncertainty
One of the feeds, for locating employees locations and/or phone numbers, merges information from two systems. However, there is not a shared primary key so the merger needs to happen based on the values of several fields which may not be in sync between the systems:
- First Name
- Last Name
- Department
- Extension
- Location
Given constraints when the first system was created, I settled on a relatively naive implementation which is O(n^2). It worked, however, which given the services were each called once daily was more than sufficient. (Premature optimization is the root of all evil!)
All of the telecom records are in one bucket. All of the ldap records are in another.
- The telecom record is compared to each ldap user record in turn, calculating a weight:
1 2 3 4 5 6 7 8 9 10 |
def match_weight(u) weight = 0 weight += 3 if u.first_name == @first_name.upcase weight += 1 if u.last_name == @last_name.upcase weight += 1 if u.department == @department.upcase weight += 1 if u.extension == @extension.upcase weight += 1 if u.location == @location.upcase return weight end |
- The highest weight and the index to that record is saved unless a threshold is reached. The assumption being that while it is possible for one or more attributes to change, it is unlikely that all of them are going to change at the same time with the first name being the least likely to change.
- The highest rated user record is merged with the telecom record and the matched record is deleted from the pool.
- Processing continues.
Worst case n + (n - 1) + (n -2) ... 1
comparisons are needed. The geek in me recognizes that this happens to have been made famous by Carl Gauss and can be expressed by n(n + 1)/2
. It’s still O(n^2), however.
Refactoring
I think that a somewhat more sophisticated algorithm would provide O(n) behaviour at the expense of memory.
It occurs to me that by pre-calculating hashes of the values I can set up a series of bucket hashes for the various attributes I can then do a quick merge of the buckets to determine which record is the best match. Additionally Ruby allows me to easily keep the weighting of the first name:
1 2 3 |
1.9.3-p551 :001 > [1,2,3] * 3 => [1, 2, 3, 1, 2, 3, 1, 2, 3] |
Then determining the best match would then look something like this (assuming that User is defined elsewhere and has proper attributes):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
class Merger @attributes = (first_name last_name department extension location) # @buckets is a hash of hashes with a bucket hash # for each attribute; def fill_buckets(users) # Let ruby pre-initialize the buckets; we just worry about populating # This lets us do something like # @buckets["first_name"]["matt"] << id # id == 1 # @buckets then looks like: # {"first_name"=>{"matt"=>[1]}} @buckets = Hash.new {|h,k| h[k] = Hash.new {|h1,k1| h1[k1] = Array.new}} @users.each do |user| @attributes.each do |attr| weight = attr.equal?("first_name") ? 3 : 1 @buckets[attr][user.get_instance_variable[attr]] << ([user.id] * weight) end end end # match matches a user to the best match from the bucket def match(user) hits = @attributes.map do |attribute| @buckets[attribute][user.get_instance_variable(attribute)] end # hits will look something like this: # [[1, 2, 3, 1, 2, 3, 1, 2, 3], [1, 4, 5], [2, 3, 5], [1]] hits.flatten! # [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 4, 5, 2, 3, 5, 1] hits.sort! # [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5] # once sorted, we can use Enumerable#chunk and # Enumerable#sort_by to determine which id has the # most hits. It's reversed so we have the element # with the greatest number of hits appearing first # in our array counts = hits.chunk {|id| id}.sort_by {|id,count| count.size}.reverse # [[1, [1, 1, 1, 1, 1]], [3, [3, 3, 3, 3]], [2, [2, 2, 2, 2]], [5, [5, 5]], [4, [4]]] match = counts.first.first end end |
Conclusion
As with anything else, there are trade-offs. While this code will run faster, it does so at the cost of memory footprint and complexity. The code is a bit more complex than the original version. However, in this case I think it is worth it.
It’s still not 100% perfect; however the case in which all of the attributes change is very, very unlikely. Moreover, without a shared primary key, there is no way that we can always be certain of a match. I’m pretty confident that the case in which all of the attributes change is so unlikely as to be non-existent, given the dataset.
Also, I’ll admit the alterior motive of having a good thought exercise for Christmas Eve. I learned something new about Ruby’s Enumerable. It’s a good day when I get to learn something new.