Merging data from different sources

This is a third post in what is becoming a series about our work on the data warehouse. We have just handed a new release of our HR data mart to a small group of BI analysts, who will test it before we deploy the live version. This release is significant because for the first time it is bringing together data from the old HR system and from the new “People & Money” system. Nowhere else in the University is providing this joined up view of historic and current data.

Merging data from different sources is often a challenging task, and this development was no different. Although the two source systems are both from Oracle and share similar data models, they also have significant differences, partly because the University’s transformation programme is changing the University HR and Finance structures.

Our data warehouse was designed our data warehouse to handle this sort of situation. The graph structure that I described in my post on Future proofing our core data can mimic the data models from different sources, and this enabled us to import the data from both systems without losing information. Now that the data is loaded, we can present it to the analysts in whichever format is most relevant to their needs. In this release, we are transforming all the data into the data model of the new system, so that analysts can create consistent reports across the two sets of data.

As with any software or data project, the test of the design comes when we actually implement it. There will always be details to work out and this release was no exception. As one example, we had to rewrite some queries to select records from the most appropriate source at any given time. This required some careful business logic and a little extra time, but it was not a major problem.

What turned out to be more of a challenge than we had expected was the synchronisation of the test data itself. We had access to a test environment of the old HR system that was last refreshed in May, and on the People & Money side, we had data from a test environment from several months later. In addition, our testers are comparing the data against the existing reporting system.

These differences meant that the number of records in our test release don’t match those in the current reporting system, and also that some of the records in the warehouse are incomplete. There are also records in People & Money for employees of subsidiary companies, such as Edinburgh University Press, and who don’t have corresponding records in the current live system. This makes the job of the testers harder. It took the integration testers quite a while to get to the bottom of these differences, and we are having to give precise information about these differences to our analyst testers.

This reflects the nature of BI projects, in which the testers are intrinsically interested in the details of the data. By comparison, most software projects are more about the functionality and user experience of the application, and they only need sufficient data to test those aspects of the system. Here, the data used for testing is crucial. In retrospect, we could have saved ourselves some trouble by using data from the production systems instead of older test environments.