CluedIn Data Modeling 101

This article covers the most important concepts you must understand when modeling data in CluedIn.

I will use pseudocode to illustrate how data is transformed when it flows through the system. Let's start with sample input data:

Raw data

Here, we have two sample records. One is from CRM:

// CRM System. Contacts
{
  "contact_id": 314,
  "contact_name": "CluedIn ApS",
  "email": "support@cluedin.com",
  "phone": "+45 91 96 56 95",
  "address": "Ringsted",
  "branch": "Master Data Management and AI",
  "notes": "The most innovative Azure-native Master Data Management Platform on the market."
}

And another is from a SQL Server table in our warehouse:

// [Warehouse].[dbo].[Companies]
{
  "Id": 217,
  "Name": "CluedIn",
  "CVR": "36548681",
  "Email": "support@cluedin.com",
  "Address": "Havnegade 39, 1058 København K",
  "BusinessType": "Agriculture"
}

CluedIn transforms raw records into objects called Clues. The Clues are then processed by CluedIn and are attached to new or existing Entities. Let's explore how our sample records can be transformed in CluedIn:

Vocabularies and Vocabulary Keys

The Vocabulary Keys are how CluedIn tracks the properties of the raw records. Each Vocabulary Key represents a unique property name in a CluedIn instance.

CluedIn Vocabularies are groups of Vocabulary Keys. A Vocabulary has a prefix, so each Vocabulary Key in a given Vocabulary starts with the same prefix.

In the example below, I create two vocabularies: one represents properties of CRM contacts, and another — of the Companies table:

# CRM Contact Vocabulary
prefix: "crm.contact"
keys:
    - "crm.contact.contact_id"
    - "crm.contact.contact_name"
    - "crm.contact.email"
    - "crm.contact.phone"
    - "crm.contact.address"
    - "crm.contact.branch"
    - "crm.contact.notes"
# Warehouse Company Vocabulary
prefix: "warehouse.company"
keys:
    - "warehouse.company.Id"
    - "warehouse.company.Name"
    - "warehouse.company.CVR"
    - "warehouse.company.Email"
    - "warehouse.company.Address"
    - "warehouse.company.BusinessType"

Now, we can map the properties of the raw records to Vocabulary Keys in CluedIn:

{
  "crm.contact.contact_id": 314,
  "crm.contact.contact_name": "CluedIn ApS",
  "crm.contact.email": "support@cluedin.com",
  "crm.contact.phone": "+45 91 96 56 95",
  "crm.contact.address": "Ringsted",
  "crm.contact.branch": "Master Data Management and AI",
  "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market."
}
{
  "warehouse.company.Id": 217,
  "warehouse.company.Name": "CluedIn",
  "warehouse.company.CVR": "36548681",
  "warehouse.company.Email": "support@cluedin.com",
  "warehouse.company.Address": "Havnegade 39, 1058 København K",
  "warehouse.company.BusinessType": "Agriculture"
}

Entity Type

Now, when we transform our raw records into clues by mapping the raw property names to Vocabulary Keys, we need to add Entity Type. The Entity Type is just a string that indicates the business domain of every data record. So CluedIn does not add apples to oranges.

By looking at our sample records, both are companies, so we set the Entity Type to /Company for both of them:

{
  "entityType": "/Company",
  "crm.contact.contact_id": 314,
  "crm.contact.contact_name": "CluedIn ApS",
  "crm.contact.email": "support@cluedin.com",
  "crm.contact.phone": "+45 91 96 56 95",
  "crm.contact.address": "Ringsted",
  "crm.contact.branch": "Master Data Management and AI",
  "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market."
}
{
  "entityType": "/Company",
  "warehouse.company.Id": 217,
  "warehouse.company.Name": "CluedIn",
  "warehouse.company.CVR": "36548681",
  "warehouse.company.Email": "support@cluedin.com",
  "warehouse.company.Address": "Havnegade 39, 1058 København K",
  "warehouse.company.BusinessType": "Agriculture"
}

Entity Codes

Entity Codes are unique identifiers, but unlike the primary keys in relational databases, you can have as many entity codes as you need for a single record.

An Entity Code has three parts:

  • Entity Type
  • Origin
  • Value

So it looks like: /EntityType#Origin:Value. The Entity Type and Origin are needed to distinguish identifiers by business domain and the origin of data so CluedIn will know that an apple with ID 7 is not the same as a banana with ID 7.

In our case, we can define the following origins based on where the data comes from:

  • CRM(Contact)
  • Warehouse(Company)

We can also see that both records have email properties. If we want to consider two companies with the same email as the same company, we can create an entity code for the email, too. Then the origin can be just Email, because as opposed to CRM Contact ID and Warehouse Company ID, the emails don't come from a particular system.

The same approach can be applied to the CVR number, which is unique for each company in Denmark.

Therefore, Entity Codes for CRM Contact will be:

  • /Company#CRM(Contact):314
  • /Company#Email:support@cluedin.com

And for the Warehouse Company:

  • /Company#Warehouse(Company):217
  • /Company#Email:support@cluedin.com
  • /Company#CVR:36548681

Now our Clues have Entity Codes:

{
  "entityType": "/Company",
  "codes": [
    "/Company#CRM(Contact):314",
    "/Company#Email:support@cluedin.com"
  ],
  "crm.contact.contact_id": 314,
  "crm.contact.contact_name": "CluedIn ApS",
  "crm.contact.email": "support@cluedin.com",
  "crm.contact.phone": "+45 91 96 56 95",
  "crm.contact.address": "Ringsted",
  "crm.contact.branch": "Master Data Management and AI",
  "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market."
}
{
  "entityType": "/Company",
  "codes": [
    "/Company#Warehouse(Company):217",
    "/Company#Email:support@cluedin.com",
    "/Company#CVR:36548681"
  ],
  "warehouse.company.Id": 217,
  "warehouse.company.Name": "CluedIn",
  "warehouse.company.CVR": "36548681",
  "warehouse.company.Email": "support@cluedin.com",
  "warehouse.company.Address": "Havnegade 39, 1058 København K",
  "warehouse.company.BusinessType": "Agriculture"
}

Processing

Let's see what happens when CluedIn processes the first Clue, which represents the CRM record.

CluedIn tries to find an existing Entity that can identify with one of Clue's Entity Codes. If it finds such an Entity, the processed Clue becomes a new Data Part of this Entity. Otherwise, CluedIn creates a new Entity, and the Clue becomes the first Data Part of the new Entity. In our case, the other happens, and so we have a new Entity:

// Entity
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": [
      "/Company#CRM(Contact):314",
      "/Company#Email:support@cluedin.com"
    ],
    "crm.contact.contact_id": 314,
    "crm.contact.contact_name": "CluedIn ApS",
    "crm.contact.email": "support@cluedin.com",
    "crm.contact.phone": "+45 91 96 56 95",
    "crm.contact.address": "Ringsted",
    "crm.contact.branch": "Master Data Management and AI",
    "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market."
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CRM(Contact):314",
        "/Company#Email:support@cluedin.com"
      ],
      "crm.contact.contact_id": 314,
      "crm.contact.contact_name": "CluedIn ApS",
      "crm.contact.email": "support@cluedin.com",
      "crm.contact.phone": "+45 91 96 56 95",
      "crm.contact.address": "Ringsted",
      "crm.contact.branch": "Master Data Management and AI",
      "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market."
    }
  ]
}

You can see that our Clue now sits in the "records" collection, but also, there is one property of the Entity called Golden Record - it is built based on all records added to this Entity.

Now, CluedIn processes the Clue from the Warehouse and finds an entity with the code /Company#Email:support@cluedin.com, so CluedIn merges the Clue with the existing Entity.

Notice that the existing Entity now has two records (data parts), and the golden record is recalculated accordingly:

// Entity
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": [
      "/Company#CRM(Contact):314",
      "/Company#Email:support@cluedin.com",
      "/Company#Warehouse(Company):217",
      "/Company#CVR:36548681"
    ],
    "crm.contact.contact_id": 314,
    "crm.contact.contact_name": "CluedIn ApS",
    "crm.contact.email": "support@cluedin.com",
    "crm.contact.phone": "+45 91 96 56 95",
    "crm.contact.address": "Ringsted",
    "crm.contact.branch": "Master Data Management and AI",
    "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market.",
    "warehouse.company.Id": 217,
    "warehouse.company.Name": "CluedIn",
    "warehouse.company.CVR": "36548681",
    "warehouse.company.Email": "support@cluedin.com",
    "warehouse.company.Address": "Havnegade 39, 1058 København K",
    "warehouse.company.BusinessType": "Agriculture"
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": [
        "/Company#Warehouse(Company):217",
        "/Company#Email:support@cluedin.com",
        "/Company#CVR:36548681"
      ],
      "warehouse.company.Id": 217,
      "warehouse.company.Name": "CluedIn",
      "warehouse.company.CVR": "36548681",
      "warehouse.company.Email": "support@cluedin.com",
      "warehouse.company.Address": "Havnegade 39, 1058 København K",
      "warehouse.company.BusinessType": "Agriculture"
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CRM(Contact):314",
        "/Company#Email:support@cluedin.com"
      ],
      "crm.contact.contact_id": 314,
      "crm.contact.contact_name": "CluedIn ApS",
      "crm.contact.email": "support@cluedin.com",
      "crm.contact.phone": "+45 91 96 56 95",
      "crm.contact.address": "Ringsted",
      "crm.contact.branch": "Master Data Management and AI",
      "crm.contact.notes": "The most innovative Azure-native Master Data Management Platform on the market."
    }
  ]
}

Vocabulary Keys Mapping

Let's look at the Golden Record properties (Vocabulary Keys). For example, many are semantically equal, like crm.contact.contact_name and warehouse.company.Name, or crm.contact.email and warehouse.company.Email.

With the help of vocabulary key mapping, we can improve this situation. When you map one (source) Vocabulary Key to another (target), CluedIn will store the source Vocabulary Key values in the target. Simply speaking, we tell CluedIn: "When we say A, we mean B".

So, let's create a new Vocabulary that will have no connection to any specific data source and will represent the data domain in business terms:

# Company Vocabulary
name: "Company"
prefix: "company"
keys:
    - Name
    - CVR
    - Email
    - Phone
    - Address
    - BusinessType
    - Notes

Now, we can map semantically equal Vocabulary Keys to the keys in our new "core" Vocabulary:

# CRM Contact Vocabulary
name: "CRM Contact"
prefix: "crm.contact"
keys:
    - "crm.contact.contact_id"
    - "crm.contact.contact_name -> company.Name"
    - "crm.contact.email -> company.Email"
    - "crm.contact.phone -> company.Phone"
    - "crm.contact.address -> company.Address"
    - "crm.contact.branch -> company.BusinessType"
    - "crm.contact.notes -> company.Notes"
# Warehouse Company Vocabulary
name: "Warehouse Company"
prefix: "warehouse.company"
keys:
    - "warehouse.company.Id"
    - "warehouse.company.Name -> company.Name"
    - "warehouse.company.CVR -> company.CVR"
    - "warehouse.company.Email -> company.Email"
    - "warehouse.company.Address -> company.Address"
    - "warehouse.company.BusinessType -> company.BusinessType"

In other words, when we say crm.contact.email, we mean company.Email, and when we say warehouse.company.Email, we mean company.Email too. This change affects our sample Entity:

// Entity
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": [
      "/Company#CRM(Contact):314",
      "/Company#Email:support@cluedin.com",
      "/Company#Warehouse(Company):217",
      "/Company#CVR:36548681"
    ],
    "company.Address": "Havnegade 39, 1058 København K",
    "company.BusinessType": "Agriculture",
    "company.CVR": "36548681",
    "company.Email": "support@cluedin.com",
    "company.Name": "CluedIn",
    "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market.",
    "company.Phone": "+45 91 96 56 95",
    "crm.contact.contact_id": 314,
    "warehouse.company.Id": 217
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": [
        "/Company#Warehouse(Company):217",
        "/Company#Email:support@cluedin.com",
        "/Company#CVR:36548681"
      ],
      "warehouse.company.Id": 217,
      "company.Name": "CluedIn",
      "company.CVR": "36548681",
      "company.Email": "support@cluedin.com",
      "company.Address": "Havnegade 39, 1058 København K",
      "company.BusinessType": "Agriculture"
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CRM(Contact):314",
        "/Company#Email:support@cluedin.com"
      ],
      "crm.contact.contact_id": 314,
      "company.Name": "CluedIn ApS",
      "company.Email": "support@cluedin.com",
      "company.Phone": "+45 91 96 56 95",
      "company.Address": "Ringsted",
      "company.BusinessType": "Master Data Management and AI",
      "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market."
    }
  ]
}

You may notice that we didn't map warehouse.company.Id and crm.contact.contact_id. This is because these properties are related to specific data systems, so it is unnecessary to map them to the common business term.

Survivorship

Also, look at the company.BusinessType Vocabulary Key in the Golden Record: in one data part, it was "Agriculture", but in another, it was "Master Data Management and AI". The "Agriculture" wins despite not being the best choice.

We can fix it with a Survivorship Rule that will prefer data from a particular source or execute any complex logic to pick the best available values.

// Entity
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": [
      "/Company#CRM(Contact):314",
      "/Company#Email:support@cluedin.com",
      "/Company#Warehouse(Company):217",
      "/Company#CVR:36548681"
    ],
    "company.Address": "Havnegade 39, 1058 København K",
    "company.BusinessType": "Master Data Management and AI",
    "company.CVR": "36548681",
    "company.Email": "support@cluedin.com",
    "company.Name": "CluedIn ApS",
    "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market.",
    "company.Phone": "+45 91 96 56 95",
    "crm.contact.contact_id": 314,
    "warehouse.company.Id": 217
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": [
        "/Company#Warehouse(Company):217",
        "/Company#Email:support@cluedin.com",
        "/Company#CVR:36548681"
      ],
      "warehouse.company.Id": 217,
      "company.Name": "CluedIn",
      "company.CVR": "36548681",
      "company.Email": "support@cluedin.com",
      "company.Address": "Havnegade 39, 1058 København K",
      "company.BusinessType": "Agriculture"
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CRM(Contact):314",
        "/Company#Email:support@cluedin.com"
      ],
      "crm.contact.contact_id": 314,
      "company.Name": "CluedIn ApS",
      "company.Email": "support@cluedin.com",
      "company.Phone": "+45 91 96 56 95",
      "company.Address": "Ringsted",
      "company.BusinessType": "Master Data Management and AI",
      "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market."
    }
  ]
}

Rules

Survivorship Rules decide which Data Part values should survive in the Golden Record.

You can also use Data Part rules to edit Data Parts in place (without creating new Data Parts of the Entity), and Golden Record rules edit the Golden Record.

Enrichment

Say you have ten names on your notepad, and millions of phones and addresses are in the public phone book. Would you write down the whole address book in your notepad? Probably not. But you may want to look up the particular names in the phone book and "enrich" the names in your notepad with addresses and phone numbers. This is exactly what Enrichers in CluedIn do.

The Enrichers are integrations that can look up public data sources and enrich your data with additional information.

In the example below, the CVR Enricher checks the public registry by Entity's company.CVR and adds another Data Part to our Entity, so it gets enriched:

// CVR data
{
  "cvr": "36548681",
  "name": "CluedIn ApS",
  "address": "Hagelbjergvej 8 4100 Ringsted",
  "phone": "+45 91 96 56 95",
  "website": "https://cluedin.com"
}
# CVR Vocabulary
name: "CVR"
prefix: "cvr"
keys:
    - "cvr.CVR -> company.CVR"
    - "cvr.name -> company.Name"
    - "cvr.address -> company.Address"
    - "cvr.phone -> company.Phone"
    - "cvr.website -> company.Website"
// Entity
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": [
      "/Company#CRM(Contact):314",
      "/Company#Email:support@cluedin.com",
      "/Company#Warehouse(Company):217",
      "/Company#CVR:36548681"
    ],
    "company.Address": "Hagelbjergvej 8 4100 Ringsted", // enriched!
    "company.BusinessType": "Master Data Management and AI",
    "company.CVR": "36548681",
    "company.Email": "support@cluedin.com",
    "company.Name": "CluedIn ApS",
    "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market.",
    "company.Phone": "+45 91 96 56 95",
    "company.Website": "https://cluedin.com", // enriched!
    "crm.contact.contact_id": 314,
    "warehouse.company.Id": 217
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": ["/Company#CVR:36548681"],
      "company.CVR": "36548681",
      "company.Name": "CluedIn ApS", // !
      "company.Address": "Hagelbjergvej 8 4100 Ringsted", // !
      "company.Phone": "+45 91 96 56 95", // !
      "company.Website": "https://cluedin.com" // !
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#Warehouse(Company):217",
        "/Company#Email:support@cluedin.com",
        "/Company#CVR:36548681"
      ],
      "warehouse.company.Id": 217,
      "company.Name": "CluedIn",
      "company.CVR": "36548681",
      "company.Email": "support@cluedin.com",
      "company.Address": "Havnegade 39, 1058 København K",
      "company.BusinessType": "Agriculture"
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CRM(Contact):314",
        "/Company#Email:support@cluedin.com"
      ],
      "crm.contact.contact_id": 314,
      "company.Name": "CluedIn ApS",
      "company.Email": "support@cluedin.com",
      "company.Phone": "+45 91 96 56 95",
      "company.Address": "Ringsted",
      "company.BusinessType": "Master Data Management and AI",
      "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market."
    }
  ]
}

CluedIn has multiple preinstalled Enrichers. You can also develop and deploy your own.

Deduplication

Imagine you ingested another record where the company name is "Clued-In" (obviously wrong!), and ID is different:

// CRM System. Contacts
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": ["/Company#CRM(Contact):42"],
    "crm.contact.contact_id": 42,
    "company.Name": "Clued-In"
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": ["/Company#CRM(Contact):42"],
      "crm.contact.contact_id": 42,
      "company.Name": "Clued-In"
    }
  ]
}

This record will not merge automatically with our existing Entity because it has no common Entity Codes. However, you can create a Deduplication Project in CluedIn and specify fuzzy matching criteria, e.g., a similar name.

Once the Deduplication Project is set up, CluedIn can automatically group entities that share similar attributes, indicating they could be duplicates.

When you merge "Clued-In" to "CluedIn ApS" because this is a duplicate of the same company, you can decide what to do with conflicting properties. So, the right name will still survive:

// Connecting record
{
  "entityType": "/Company",
  "codes": [
    "/Company#CluedIn(mergeEntities):fb3497a3-620b-4772-9f28-99e721fb6d9c"
  ],
  "company.Name": "CluedIn ApS"
}
// Entity
{
  "goldenRecord": {
    "entityType": "/Company",
    "codes": [
      "/Company#CRM(Contact):42",
      "/Company#CRM(Contact):314",
      "/Company#CVR:36548681",
      "/Company#Email:support@cluedin.com",
      "/Company#Warehouse(Company):217"
    ],
    "company.Address": "Hagelbjergvej 8 4100 Ringsted",
    "company.BusinessType": "Master Data Management and AI",
    "company.CVR": "36548681",
    "company.Email": "support@cluedin.com",
    "company.Name": "CluedIn ApS",
    "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market.",
    "company.Phone": "+45 91 96 56 95",
    "company.Website": "https://cluedin.com",
    "crm.contact.contact_id": 314,
    "warehouse.company.Id": 217
  },
  "records": [
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CluedIn(mergeEntities):fb3497a3-620b-4772-9f28-99e721fb6d9c"
      ],
      "company.Name": "CluedIn ApS"
    },
    {
      "entityType": "/Company",
      "codes": ["/Company#CRM(Contact):42"],
      "crm.contact.contact_id": 42,
      "company.Name": "Clued-In"
    },
    {
      "entityType": "/Company",
      "codes": ["/Company#CVR:36548681"],
      "company.CVR": "36548681",
      "company.Name": "CluedIn ApS",
      "company.Address": "Hagelbjergvej 8 4100 Ringsted",
      "company.Phone": "+45 91 96 56 95",
      "company.Website": "https://cluedin.com"
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#Warehouse(Company):217",
        "/Company#Email:support@cluedin.com",
        "/Company#CVR:36548681"
      ],
      "warehouse.company.Id": 217,
      "company.Name": "CluedIn",
      "company.CVR": "36548681",
      "company.Email": "support@cluedin.com",
      "company.Address": "Havnegade 39, 1058 København K",
      "company.BusinessType": "Agriculture"
    },
    {
      "entityType": "/Company",
      "codes": [
        "/Company#CRM(Contact):314",
        "/Company#Email:support@cluedin.com"
      ],
      "crm.contact.contact_id": 314,
      "company.Name": "CluedIn ApS",
      "company.Email": "support@cluedin.com",
      "company.Phone": "+45 91 96 56 95",
      "company.Address": "Ringsted",
      "company.BusinessType": "Master Data Management and AI",
      "company.Notes": "The most innovative Azure-native Master Data Management Platform on the market."
    }
  ]
}

Notice that CluedIn creates a connecting data part that contains the resolution for the conflicting property (if you want to override the default logic manually). Otherwise, you can just let CluedIn decide the best value.

Export

Exporting data from CluedIn is a long, separate topic. But it's important to notice that the Entity from our example is now enriched and groups multiple records, so its data quality improved compared to the raw records we ingested in CluedIn.

When you export this Entity's Golden Record, you will export its Entity Codes:

  • /Company#CRM(Contact):42
  • /Company#CRM(Contact):314
  • /Company#Warehouse(Company):217

So now you can feed this data into your source systems (CRM and Warehouse) and update corresponding records there.

Conclusion

There may be too much information in this article at one time, but trust me when you are working with CluedIn, it will help you answer most of the questions related to data modeling.