Map IDs during data import

It’s always best to develop against realistic data. No data is more realistic than production data. If you can legally and ethically import production data to development, you should definitely do so.

We’re developing a new rebate program for pharmaceuticals. Rather than creating a dummy product catalog, we would rather import the actual catalog from production.

The problem is the auto-incrementing IDs. We’re not just pulling in a flat structure; we’re maintaining relationships in a tree structure. When we insert rows into the dev database, these rows get different IDs than the production rows. As a result, the foreign key relationships get broken.

One solution to this problem is to drop all of the data and do an identity insert. Temporarily turn off the auto-incrementing behavior of the primary key so that you can specify exactly what it should be. This brings all of the production IDs into dev, so the foreign key relationships still line up.

The problem with that solution is that you have to drop all of the existing dev data in order to import. We have other structures that are loosely related to the product catalog. Dropping the product catalog would require us to drop some of the tables used by these other systems. All of the test data we loaded for those systems would be lost.

Here’s my solution
Instead of enforcing that the development IDs are equal to the production IDs, we can instead record a map. By joining to this map, we can insert the development IDs into foreign key columns, preserving the relationships from production. Here’s how we do it.

Step 1: Restore a backup of production data into a new database on the development server. You might also use replication for this, if you already have a subscription. Be sure that you get only the subset that you are legally and ethically permitted to use. In our case, the product catalog is what we need, but we do not restore any tables containing patient information.

Step 2: Identify the natural key. The ID that you are mapping is the surrogate key. The natural key is the set of columns that tell you that this is conceptually the same row. In our case, we are using the drug class name and the product group name.

Step 3: Copy rows that do not already exist based on the natural key. Use INSERT … SELECT … WHERE NOT EXISTS for this operation. For example:

INSERT INTO [Dev].dbo.[DrugClass]
           ([Name])
SELECT [Name]
  FROM [Prod].dbo.[DrugClass] s
  WHERE NOT EXISTS (SELECT * FROM [Dev].dbo.[DrugClass] t WHERE t.Name = s.Name)
GO

Step 4: Create a temp table to map surrogate keys. Populate this table by joining the natural keys.

DROP TABLE #DrugClassMap
GO

SELECT s.DrugClassId as SourceDrugClassId, t.DrugClassId as TargetDrugClassId
  INTO #DrugClassMap
  FROM [Prod].dbo.[DrugClass] s
  JOIN [Dev].dbo.[DrugClass] t ON s.Name = t.Name
GO

Step 5: Use the temp table to import related tables. Join to the source table by source ID, and insert by target ID.

INSERT INTO [Dev].dbo.[ProductGroup]
           ([DrugClassId]
           ,[Name])
SELECT m.[TargetDrugClassId]
      ,s.[Name]
  FROM [Prod].dbo.[ProductGroup] s
  JOIN #DrugClassMap m ON m.SourceDrugClassId = s.DrugClassId
  WHERE NOT EXISTS (SELECT * FROM [Dev].dbo.[ProductGroup] t WHERE t.Name = s.Name AND t.DrugClassId = m.TargetDrugClassId)
GO

Now you can run the script anytime you need to import data. You won’t destroy existing data in the process.

Leave a Reply

You must be logged in to post a comment.