Phase One Joins and Data Matching

I’ve just posted a new Greatest Hits article on the ILM forum on the subject of how ILM (or the FIM Sync Service) can be used to clean up the mess of existing accounts, before you can actually get  on to the more interesting tasks of provisioning and updating. With the way FIM codeless sync works, needing an existing attribute to match on, and only allowing simple matching rules, it will be more important than ever to start from a position of tidy directories with correctly identified existsing accounts. Here’s the article… 

Phase One Joins and Data Matching

The tedious truth is that most IdM projects must begin with a phase of data matching and cleaning. Before you can start to automate management of identities, you need a predictable
data set around which to base your rules. Many organizations today, whether due to changes in IT personnel, company mergers, variable naming conventions or lack of guidance on handling resignations, have existing user account bases that can only be described as a mess.

This document covers some of the methods you can use with ILM to get through that first project phase. Unfortunately there are no magic bullets here – eye-straining, brain-numbing trawls through long lists of unmatched accounts cannot usually be avoided. What you can do is try to extract quality data, and construct your lists as helpfully as possible, so they can be targeted at the people whose eyeballs and brains are most likely to give the best response.

It must be added that this is a very big topic, and the methods you choose will depend on the data you are faced with, and the aims of your project.

Joins

Really, it’s all about joins.

Once existing accounts are joined to their correct data source (such as matching an HR record to an AD account) you can begin to flow updates.

Once you have a clear idea who does and does not have an account in a target directory, you can begin to make provisioning and deprovisioning decisions.

But be careful – your joins must be reliable! The last thing you want is to kill the credibility of your fledgling IdM project by updating someone’s account with another person’s details or, even worse, deleting their account because you thought they’d left! 
 

 

Your eventual aim is a simple Join Rule

When you first start data matching, you will use a lot of different join rules, and you will make joins manually and with CSV files (more on that below). But keep
this in mind: 

Any join made manually, or with extra effort, should be considered temporary. 

There are various situations in ILM which can only be reliably rectified by a clear-out and re-import of a connector space.

Always plan for this.

Ideally, when you re-import a connector space, you will have a single, direct join rule which effortlessly re-joins all your objects. And to achieve this we use… 

Breadcrumbing

Once the join is verified you should export a uniquely identifying attribute, such as an employee number, to the target directory.

After that a simple “employeeID = employeeID” type join rule is all you will need. 

 

Sometimes you are faced with an import-only system. For either technical or political reasons you are not able to export the breadcrumb attribute.

There are a couple of things you can do: 

  1.  Import the DN or other identifying attribute from the target directory into an attribute on the Metaverse object.But be warned, you could still lose these joins if you had to repopulate your Metaverse.
  2.  To be on the safe side you should also “save” these joins by exporting an identifying pair somewhere else – for example using a database or text file
    MA, as pictured below.

 

Understanding Join Rules

There are some key points to note about Join Rules: 

  1. Join rules operate on the CS object. It takes one CS object and attempts to find a match among all Metaverse objects of the correct type, even if they are already joined.
  2. Components within one rule are AND-ed together – they must all match.
  3. Multiple rules are evaluated top down, so put your strongest rules at the top.
  4. While you can offer variations on a CS objects attribute using an Advanced Join Rule, you have to find an exact match with an attribute already on a Metaverse
    object. There are no “StartsWith” or “Contains” comparisons.

Resolve Join

At the bottom of the Join Rule configuration you will see a check box “Use rules extension to resolve”.

Here you can link to code you write under the ResolveJoinSearch subroutine in the MA extension.

The Resolve rule is used when ILM finds multiple possible matches in the Metaverve (including already-joined objects).

All the possible matches into the collection rgmventry and your job, in the code, is to check through them, looking for the best possible match.

If your code finds an ideal match you return the index number of the object in imventry, and set the value of ResolveJoinSearch to true.

The following example only joins if a single, unjoined Metaverse object was found. 

Public Function ResolveJoinSearch(ByVal joinCriteriaName As String, ByVal csentry As CSEntry, ByVal rgmventry() As MVEntry, ByRef imventry As Integer, ByRef MVObjectType As String) As Boolean Implements IMASynchronization.ResolveJoinSearch
   Select Case joinCriteriaName
      Case "Resolve_NotYetJoined"
         If rgmventry.Length = 1 AndAlso _
            rgmventry(0).ConnectedMAs("AD").Connectors.Count = 0  Then
            imventry = 0
            Return True
         Else
            Return False
         End If
   End Select
End Function

Advanced Join Rules

A simple join rule directly matches a connector space attribute to a Metaverse attribute:

  Connector Space   Metaverse  
  givenName = Kathryn   FirstName = Kathryn  
  sn = Bigalow   Lastname = Bigalow  

With an Advanced Join Rule you construct a list of possible values with which to find an exact match in the Metaverse:

  Connector Space   Metaverse  
  givenName = KathryngivenName = KategivenName = Kathy   FirstName = Kate  
  sn = Bigalow   Lastname = Bigalow  

Sometimes you can use code rules to make your list of possible matches (eg., presenting a phone number in different formats); other times you have to use long
look-up lists of possible variations (try genealogy websites for name-variation lists).

The following example uses a lookup file of aliases, where each line has the possible variations on a name.

If a match to the first name is found on the connector space object, the whole line is added to the possible values to search for in the Metaverse. 

Elizabeth,Liz,Beth,Betty
David,Dave,Davey
Jerome,Jérôme
… 

Public Class MAExtensionObject
   Implements IMASynchronization
   Dim fileAliases As System.IO.StreamReader
   Dim arrAliases As String()
   Dim i As Integer

   Public Sub Initialize() Implements IMASynchronization.Initialize
      fileAliases = New System.IO.StreamReader("C:\aliases.txt",  System.Text.Encoding.Default)
      i = 0
      While Not fileAliases.EndOfStream
         ReDim Preserve arrAliases(i)
         arrAliases(i) = fileAliases.ReadLine
         i = i + 1
      End While

     fileAliases.Close()
   End Sub

   Public Sub MapAttributesForJoin(ByVal FlowRuleName As String, ByVal csentry As CSEntry, ByRef values As ValueCollection) Implements IMASynchronization.MapAttributesForJoin
      Select FlowRuleName
         Case "Join_aliases"
            Dim aliasList As String
            Dim value As String

            For Each aliasList In arrAliases
               If aliasList.Contains(csentry("givenName").Value) Then
                  For Each value In aliasList.Split(",".ToCharArray)
                     values.Add(value)
                  Next
	      End If
            Next
         End Select
     End Sub
End Class

Using CSV Files

A lot of the difficult account matching will probably be done outside ILM.

It is therefore useful to be able to “export” lists of possible matches, and later, “import” the joins from a CSV file.

Be careful when writing to files from extension code – the DLL doesn’t unload for five minutes after the MA run completes, which means you may have to wait for it
to finish writing to the file. If you’re in a hurry, recompiling the code will force the DLL to finish writing to the file. 

Export Possible Matches to a CSV file

You can use the Resolve rule to export possible matches to a CSV file. The following example resolves the join rule “sn | Direct | lastname”. As a match on the last
name alone is too weak for an immediate join, we just write the possible matches to the text file. 

Public Class MAExtensionObject Implements IMASynchronization
   Dim fileMatches As System.IO.StreamWriter

   Public Sub Initialize() Implements IMASynchronization.Initialize
      fileMatches = New System.IO.StreamWriter("C:\possible matches.txt", System.Text.Encoding.Default)
   End Sub

   Public Sub Terminate() Implements IMASynchronization.Terminate
      fileMatches.Close()
   End Sub

   Public Function ResolveJoinSearch(ByVal joinCriteriaName As String, ByVal csentry As CSEntry, ByVal rgmventry() As MVEntry, ByRef imventry As Integer, ByRef MVObjectType As String) As Boolean Implements IMASynchronization.ResolveJoinSearch
      Select Case joinCriteriaName
         Case "Resolve_Lastname"
            Dim MAName As String = csentry.MA.Name
            Dim mvobject As MVEntry
            Dim cFirstname, mFirstname As String

            If csentry("givenName").IsPresent Then
               cFirstname = csentry("givenName").StringValue
            Else
               cFirstname = "UNKNOWN"
            End If

            If mvobject("firstname").IsPresent Then
               mFirstname = mvobject("firstname").StringValue
            Else
               mFirstname = "UNKNOWN"
            End If

            For Each mvobject In rgmventry
               If mvobject.ConnectedMAs(MAName).Connectors.Count = 0  Then
                  fileMatches.WriteLine(csentry("sn").StringValue & ";" _
     		   & cFirstname & ";" _
     		   & mvobject("lastname").StringValue & ";" _
     		   & mFirstname)
               End If
            Next
            Return False
         End Select
   End Function
End Class

You will need to adapt this code of course. Firstly, you probably want to export a lot more identifying information in your CSV file – department, email address,
dn … whatever helps. Next, it can really help to supplement your possible matches with a probability score. This is where you do a series of tests and add points-
the more points, the higher the chance of the match. For example: 

  • Names similar*          +1
  • Department the same +1
  • City the same           
    +1

*Some tips for testing if names are similar: 

  • Strip out all spaces, dashes and hyphens then compare;
  • Check if one string is contained in the other (so that “Sally-Anne” gets
    a point for “Sally”);
  • Use a function which compares string similarity (search “Soundex” and “Levenshtein
    distance”).

Join from CSV

You can use an Advanced Join Rule to “import” joins from a CSV file.

Firstly, our CSV file must be constructed like this: 

CS_identifier;MV_identifier 

For example, if we are trying to match an AD account against Metaverse objects imported from the HR system, we populate the CSV with the AD DN and the employeeID: 

CN=Fred Bloggs,OU=User,OU=MyOrg,DC=mydomain,DC=com;0012988 

Next we create the Advanced Join Rule which will look up the csobject’s DN in the text file, but use the employeeID to search the Metaverse. 

Note that you can’t actually use the DN in the join rule – but that’s ok, just use any attribute that definitely exists. Eg., 

sAMAccountName | Rules extension – Join_CSV | employeeID 

And now for the code : 

Imports Microsoft.MetadirectoryServices

Public Class MAExtensionObject Implements IMASynchronization
   Dim joins As String()
   Dim i As Integer

   Public Sub Initialize() Implements IMASynchronization.Initialize
      Dim fileJoins As System.IO.StreamReader
      Dim strLine As String
     'Open the csv file and read into an array
      fileJoins = New System.IO.StreamReader("C:\joins.csv", System.Text.Encoding.Default)
      i = 0
      While Not fileJoins.EndOfStream
         ReDim Preserve joins(i)
         joins(i) = fileJoins.ReadLine
         i = i + 1
      End While
      fileJoins.Close()
   End Sub

   Public Sub MapAttributesForJoin(ByVal FlowRuleName As String, ByVal csentry As CSEntry, ByRef values As ValueCollection) Implements IMASynchronization.MapAttributesForJoin
      Select FlowRuleName
         Case "Join_CSV"
            'If the csentry DN is found in the joins array, then
            'use the paired employeeID to search the Metaverse.
            For i = 0 To joins.Length - 1
               If joins(i).Contains(csentry.DN.ToString) Then
                  values.Add(joins(i).Split(";")(1))
               End If
         Next

   End Select
End Sub

Reporting

People will ask you questions like “How many people have you joined in system X but not in Y?”, “How sure are you that the joins are correct?”, “Which department
has the most unidentified accounts?” It’s best to be prepared for these sorts of questions. 

Querying the Metaverse

Once data is in the Metaverse it is a simple matter to access it for reporting, either by exporting it into a reporting table, or by directly querying the underlying
tables (as long as you’re careful to do it when ILM is idle, or else use NOLOCK). 

So consider this: During the data cleaning phase, import all identifying attributes into the Metaverse from all sources. 

For example: You’ve made a join between a user in AD and an HR record.

Under normal operations you would consider HR as the master source for the name, and you would only flow it from there.

You wouldn’t bother importing the name attributes from AD – in fact you’re more likely to be overwriting them with export flow rules.

  HR       Metaverse       AD  
  Lastname = Powells-Brown   ->   lastname = Powells-Brown   ->   sn = Powells-Brown
  Firstname = Joanna   ->   firstname = Joanna   ->   givenName = Joanna  

However, during the data matching phase, you’re probably not ready to start overwriting attributes, and the information about current values can be very important in your
verification and reporting.

  HR       Metaverse       AD  
  Lastname = Powells-Brown   ->   HR_lastname = Powells-Brown   ->   sn = Powells-Brown  
  Firstname = Joanna   ->   HR_firstname = Joanna
AD_lastname = Powells
AD_firstname = Jo
  <-
<-
  givenName = Joanna
givenName = Jo
 

Now, if you have a look at the mms_metaverse table in the ILM database, you will see how simple it is to query the progress of your joins, and also to judge
on what criteria the joins were made.

Some example queries… 

/* HR person with no join to AD */

select HR_lastname, HR_firstname, HR_employeeid from mms_metaverse
where AD_dn is null 

 /* HR person with join to AD */

select HR_lastname, HR_firstname, HR_employeeid, AD_lastname,AD_firstname,AD_dn from mms_metaverse
where AD_dn is not null 

Caution Caution
As mentioned above you need to be
careful when directly querying the Metaverse tables.If your system is already in production, and you happen to be adding in
a new data source, then you may be better off employing a SQL MA to export
the data you’re interested in to another table, where you can query it as
much as you like without risk of locking errors.

  

Querying the Connector Space

Unfortunately it is not so simple to query the connector space to, for example, report on that state of your disconnected objects.

It’s a great pity that you can’t just save results from the Joins page in the Identity Manager GUI, so your options are: 

SQL query – The CS table holds data differently to the Metaverse tables.

It is possible to query for disconnectors in this way, however you will only be able to retrieve the CN of objects – which may not be sufficient to identify them. 

select cs.rdn from dbo.mms_connectorspace cs
join dbo.mms_management_agent ma
on cs.ma_id = ma.ma_id
left outer join dbo.mms_csmv_link mv
on mv.cs_object_id = cs.object_id
where ma.ma_name = 'My MA'
and mv.mv_object_id is null
and cs.connector_state = 0

CSExport – The command-line utility csexport.exe, found in the <ILM program>\bin folder, will allow you to dump connector space objects to an XML file. 

Report directly from the data source – Once an object in the data source has been correctly identified you will ideally export a unique attribute out to it. It may then be possible to identify the non-joined objects as those which don’t possess this attribute. 

Project to a different object type – ILM processes joins before projections, so it is fairly simple to project all non-joined objects to a different Metaverse object type – for example, one called “disconnectors”. This may help in reporting on an overall status direct from the Metaverse tables.

Note however that to make these objects once again available for joins they will have to be disconnected from the MVExtension provisioning code. 

Advanced Tips and Tricks

Join to Multi-value Attribute

Sometimes you may need to join to a value in a multi-value attribute. An example is searching through all the proxyAddresses for a match against
a single email address. 

When the multi-valued attribute exists in the connector space, and the single valued attribute is in the Metaverse, this is very easily accomplished. Just use an Advanced Join Rule to break the multi-valued attribute down into the values list used by the join rule. 

 Public Sub MapAttributesForJoin(ByVal FlowRuleName As String, ByVal csentry As CSEntry, ByRef values As ValueCollection) Implements IMASynchronization.MapAttributesForJoin
   Select FlowRuleName
      Case "Join_proxyAddresses"
         Dim alias As String
         For Each alias In csentry("givenName").Values
            values.Add(alias)
         Next
   End Select
End Sub

However, when the multi-valued attribute has already been imported into the Metaverse you will have a problem. While you can join on a multi-valued attribute, you have to join on the whole thing. There is no way that you can match one value out of a multi-valued Metaverse attribute against a single-valued connector space attribute. 

Some possible options: 

  • Import the multi-valued attribute into a series of single-valued attributes in the Metaverse, eg., proxy1, proxy2, proxy3 …
  • Use a different Metaverse object type to do the project and join the other way around.

Multiple Possibilities in the Connector Space

All the joining techniques so far have been based around a single connector space object, with one or more possible matches in the Metaverse. But what do you do if you want to work the other way around, where there are multiple possible matches in the connector space for a single Metaverse object, and you want to pick the best one? Unfortunately this is not straight-forward. There is no way to offer a selection of connector space objects in a join rule, as joins always work on a single connector space object at a time. You can judge the merits of the current CS object, but you can’t tell if there’s a better one coming up. One way around this is to do everything with CSV files. 

  1. Use the ResolveJoinSearch to write possible joins to a CSV, but don’t actually join anything,
  2. Do your matching outside ILM, then
  3. Use a CSV with an Advanced Join Rule to make the joins.

Too much data

If you have an enormous number of accounts, and different data sources to trawl through, you may be best off doing your data matching outside of ILM, and then just using the CSV join rule above to make the joins.  Check out the Fuzzy Lookup Transformation from the Enterprise version of SQL SSIS for help here. 

Take-Home Thoughts

  • Complex join rules are a means to an end – not the end itself.
  • Breadcrumbing is essential for automation.
  • When matching on weak rules (eg., surname only) then verify the match another way.
  • You can’t do it all in ILM. CSV, Excel and fuzzy lookup algorithms will also help, but an element of by-hand matching is inevitable.
  • Get the matching and breadcrumbing sorted out before you start flowing and provisioning.This will make for a happier project, stake holders, users and YOU!

Extra Reading

About the Author

Carol Wapshere is an ILM MVP and contributor of a few of these documents now. She has always believed that putting the work in up-front will save you lots of headaches in the long run. It’s a self-defense strategy really – there’s nothing worse than having to go over and over (and over) the same ground again and again, especially when you’d already moved on to something new and far more interesting. Keep it neat, and keep it simple, and things should work just fine. 

Thanks to Markus Vilcinskas and Paul Loonen for their help with this document.

8 Replies to “Phase One Joins and Data Matching”

  1. Carol – tonight I’ve been wondering about what constitutes best practice for implementing complex (or even semi-complex) join rules in FIM. What I am trying to do is understand how the “breadcrumbing” idea (Markus calls it a “correlation id”) might work in a FIM model where the join rule has to be a simple match on one or more metaverse attributes. The more I think about it the more I am leaning towards the traditional approach – but how will this work if I still want to implement my flow rules in the FIM portal? The question would probably be this … if I implement a combination of join rules and then implement the flow rules in the FIM portal, including writing back a “breadcrumb”, would that mean that the attribute I specify on the join rules tab should be the match on my breadcrumb? Any direction on this would be much appreciated … clients have been sold on the codeless concept, and I am not keen on muddying the water by implementing rules in 2 places if I can avoid it.

  2. Hi Bob,

    as far as I can tell the FIM Sync rules only support the simplest type of join rule, so all of the gymnastics covered in this post would still have to be done the old ways – hopefully just initially as part of sorting out the legacy data mess until you could get to a point of exporting your breadcrumb, after which FIM Sync rules would be sufficient. On one particularly complex project I’ve been using one FIM server as the “joins server” and seperate one as the “production server” doing the provisioning and updating.

    I just realised something about the Sync Rules the other day – I had thought the match criteria was a kind of “soft join” re-evaluated each time, so if the matching attribute changed in the target directory it could join to a different metaverse object the next time around. But it actually does work exactly the same as a join always has done. I guess this at least means you won’t get anyweird situations where multiple metaverse objects try to flow to one cs object – when it’s joined it’s joined, just like it always was.

  3. Thanks Carol – just read your “FIM Newbies” post and there is a consistency developing here in what you say. I was wondering about your first paragraph in this post where I stopped on your words “doing away with the notion of permanent joins”. I guess your reply explains that now – I was sort of hoping that the joins are still permanent, cos that would have changed the playing field for me! 🙂

  4. P.S. what are your thoughts on this post having worked with FIM for a good while now: http://forums.novell.com/novell-product-support-forums/identity-manager/im-engine-drivers/415131-idm-vs-fim-2.html … do you hear much of this lately? What would be your response based on your experience so far? I think some guidelines on the appropriate use of “codeless” sync rules will be needed pretty soon to ensure that performance doesn’t come into question …

  5. Interesting. I might just have to jump into that discussion. I had a bad experience with DirXML years ago and haven’t been near it since, but I’ve been reading up on IDM 4 and I’d like to learn more. I think the Microsoft sales guys are kidding themselves if they think FIM can do anything near what a product like IDM can do – but on the other hand I do believe you have more flexibility with FIM due to the DIY nature of the product. Of course that also means you need someone who understands all this DIY… I think a lot of it depends on the client and the project. Despite them having given me an MVP, I don’t actually think FIM is the choice for every occasion, BUT when I manage to do something like make it run BPOS powershell cmdlets I think to myself “that’s pretty cool!”

  6. In our provisioning processes, I’ve been managing proxyAddresses directly based on values in a database table, so it’s been straightforward. Now our Exchange team wants to reverse that flow, and instead of me setting proxyAddress values, let Exchange to that and flow them back. No problem mapping proxyAddress back from ADMA into the MV objects. However, now they want me to create distribution lists where user A may request that his mail be redirected to user B, so the spec calls for attempting to join the stated address in the datasource to all proxyAddress values in the MV.

    So my question is, do you still think that my best approach is to map the proxyAddress values in the ADMA CS to a set of attributes in the MV so I can join against them, or have you had any better ideas?

    Maybe I’m obtuse, but I’m not getting the “Use a different Metaverse object type to do the project and join the other way around” suggestion at all.

  7. Hi Bill. This post was written very much in mind of the initial sort of gymnastics you do to get accounts matched and cleaned up – I wouldn’t recommend these steps as operational options. With the “join the other way” I was talking about a temporary projection into metaverse objects, just so you could join against CS objects in another connector space (where the multivalue attribute came from), after which you would swiftly export an identifying attribute out to your newly joined objects, and then clear them out of the metaverse.

    With these groups of yours – could you use SQL to generate them and link them up to their members? Complex logic is almost always best done outside the sync service.

Comments are closed.