Ordering Tables To Preserve Referential Integrity

R Glen Cooper

This article first appeared in www.sqlservercentral.com (20 March 2009).

When merging two databases with the same schema, primary keys may have the same values in both databases, yet represent unrelated records. Typically you would carry over such keys as legacy values, re-create new ones in the target database, and then re-calculate all foreign keys that reference them in related tables. But in what order should these tables be ported to avoid violating referential integrity in the target database?

Consider the following schema for a customer database (Fig. 1):

Fig. 1


If this database is appended to another database having the same schema, all primary keys in the source database must be re-calculated to new values in the target database to avoid clashing with existing values. Furthermore, all foreign keys pointing to them from other tables must also be re-calculated so they continue referencing the same records as before. That's the easy part.

The hard part is figuring out the order these tables should be ported.

In the above example, the Customer table must be ported before the Customer_Address table so that referential integrity won't be violated in the target database. In other words, when porting the latter table, all records from the former table must already be present so that the foreign key in the latter table can be re-calculated using the legacy values from the original table. Otherwise the insertions will fail.

We want to assign a "level" to each table, so that if they're ported by ascending level then no violations of referential integrity will occur.

In the above example, it's easy to see that the following assignment of levels will do that:

Level
Table
Abbreviation
0
Customer
C
0
Employee
E
0
Supplier
S
1
Customer_Address
CA
1
Invoice_Header
IH
1
Inventory
I
1
Supplier_Address
SA
2
Inventory_Order
IO
2
Invoice_Sublet
IS
2
Invoice_Task
IT
2
Invoice_Item
II
2
Related_Part
RP

Here the Customer, Employee and Supplier tables may be ported in any order (so long as they're ported first). Ditto for Customer_Address, Invoice_Header, Inventory, and Supplier_Address (so long as they're ported next). But Invoice_Item must be ported later since it has a foreign key pointing to Invoice_Header.

Furthermore, it's easy to see that this assignment of levels is "optimal" in the sense that each table receives the lowest possible assignment.

The following script computes an optimal assignment of levels that preserves referential integrity.

To explain how it works, it's helpful to use the language of partially-ordered sets (see
http://en.wikipedia.org/wiki/Partially_ordered_set  for an overview of posets).

Suppose that Invoice_Header has a foreign key pointing to Customer.

We can represent this dependency as:

Customer < Invoice_Header

where the "smaller" table Customer must be ported before Invoice_Header.

Suppose also that Invoice_Item has a pointer to Invoice_Header.

Then Invoice_Header < Invoice_Item.

Of course, this now implies that Customer must always be ported before Invoice_Item even though there's no direct link between them.

To formally describe this "transitivity of dependency", recursively define the relationship << as:

A << B if A < B
A << B if A < C and C << B for some C

Prolog programmers will recognize these "axioms" as the classical ancestor relationship. In particular, Customer << Invoice_Item. More generally, << is a partial order expressing what tables must be ported before others, even if there's no pointer between two tables satisfying this relationship.

It's easy to demonstrate that << defines a partial order since closed loops in < (called a preorder) aren't possible for tables with data in them. Note that two tables violating referential integrity can still co-exist providing that no records have been inserted into either of them (and of course, it will always remain that way). This remote possibility is addressed by the script because otherwise they'll cause infinite loops.

To port these tables while respecting their dependencies, it's sufficient to list them in such a way that each table appears "before" those that are "larger" in the partial ordering <<.

That's what the script does.

The so-called Hasse diagram in Fig. 2 displays the above preorder (with arrows representing <). It will show us how to assign the table levels.


Fig. 2

What we first do is add a fictional table F (see Fig. 3) and enough virtual arrows so that every non-fictional table points to at least one other table (possibly fictional). By using this trick we'll avoid special cases. This virtual object is reminiscent of the well-known "point of infinity" in non-Euclidean geometry since its purpose is to simply appear "bigger" than anything else in the database for the purposes of the script.


Fig. 3

Then we select those tables to which no arrow points, assign a level of 0 to them, and remove them from the diagram (along with all arrows connected to or from them). These are the tables that can be ported first, since no other tables depend on them.

For example, we would remove S and its two arrows. But since arrows are represented in the script by rows of an SQL Server table whose columns define the "to" and "from" tables, SA would suddenly disappear if the fictional table weren't present (so SA would never be part of the final answer). That's why we have F.

After this we start all over with whatever remains, but assign a level of 1. Note that SA has been assigned a level of 1 at this point, but it could never be assigned anything lower because an incoming arrow was originally present (from S). In particular, each table will be assigned its own level "just in time". In fact, levels are just the longest path lengths of any table to one of level 0, if you follow the arrows backwards.

We continue this way until no more elements remain.

Then we remove F from the answer.

/*
Script:		Ordering database tables in SQL Server 2005

Purpose:	This script defines a 'level' for each 
		table that's involved in a relationship
		so that porting such tables by ascending level 
		will not violate referential integrity during
		data ports.

Note:		The tables tblPreorder and tblAnswer must
		not exist in the database (otherwise, rename
		them or use table variables).  Ditto for the
		value of @TableName below.  
*/

-- No counting
SET NOCOUNT ON

-- Declarations
DECLARE @Level		INT
DECLARE @NumTables	INT
DECLARE @TableName	VARCHAR(255)

-- Arbitrarily set the name of a fictional table. 
-- This is used to insure that each table is 'less than' 
-- at least one table in the loop below (so we create a 
-- fictional one and later delete it from the answer).  
-- With this in place, the script can avoid awkward, 
-- special cases.
SET @TableName = 'Z'

-- Determine the number of tables.
-- This number is used to bound the maximum value of @Level. 
SET @NumTables = (SELECT COUNT(*) from dbo.sysobjects WHERE OBJECTPROPERTY(id, N'IsUserTable') = 1)

-- Drop preorder and answer tables
IF (SELECT TABLE_NAME FROM information_schema.tables WHERE TABLE_NAME = 'tblPreorder') IS NOT NULL
	DROP TABLE tblPreorder
IF (SELECT TABLE_NAME FROM information_schema.tables WHERE TABLE_NAME = 'tblAnswer') IS NOT NULL
	DROP TABLE tblAnswer

-- SQL Server 2005 version.
-- Create a preorder table using the database 
-- relationships between tables.  By definition,
-- a table A will be 'less than' a table B if 
-- B has a foreign key pointing to A.
SELECT A, B
INTO tblPreorder
FROM
(
SELECT
-- May be multiple relationships between two 
-- tables but we only need one
DISTINCT	
so1.name as B,
so2.name as A
FROM
sys.foreign_key_columns fk INNER JOIN
sys.objects so1 ON fk.parent_object_id = so1.object_id INNER JOIN
sys.objects so2 ON fk.referenced_object_id = so2.object_id INNER JOIN
sys.objects so3 ON fk.constraint_object_id = so3.object_id
) derived
ORDER BY A, B

/*
-- SQL Server 2000 version (kludge). 
-- Create a preorder table using the database
-- relationships between tables.  By definition,
-- a table A will be 'less than' a table B if 
-- B has a foreign key pointing to A.
SELECT A, B
INTO tblPreorder
FROM
(
SELECT
-- May be multiple relationships between two
-- tables but we only need one
DISTINCT 
-- Get name of parent table pointing to foreign table
so2.name as B,
-- Get name of foreign table by deleting
-- name of parent table (along with punctuation)
-- from name of foreign key constraint 
RIGHT(so1.name,LEN(so1.name) - (LEN(so2.name)+4)) as A 
FROM 
sysobjects so1
INNER JOIN
sysobjects so2
ON
so1.parent_obj = so2.id
WHERE
-- Restrict sysobjects to foreign key constraints  
so1.xtype='F' 
AND
-- Make sure name of foreign table represents 
-- a real table because sysobjects will append
-- 1, 2, etc. to foreign key constraint name if 
-- multiple relationships exist between a pair of them 
RIGHT(so1.name,LEN(so1.name) - (LEN(so2.name)+4)) IN (SELECT NAME FROM sysobjects WHERE xtype = 'U')
) derived
ORDER BY A, B
*/

-- Delete rows in preorder table involving pair of tables
-- that are less than each other, including self-joins.
-- Such pairs of tables are unlikely but it can happen if 
-- they're empty (and they'll always remain so).
-- Formally:
-- Delete rows R1 in preorder table 
-- where 
-- R1(A) = R2(B) and R1(B) = R2(A)
-- for some (possibly the same) row R2 
DELETE 
tblPreorder 
FROM
tblPreorder
INNER JOIN 
(
SELECT 
t1.A, 
t1.B 
FROM 
tblPreorder t1 INNER JOIN 
tblPreorder t2 ON 
t1.A = t2.B AND 
t1.B = t2.A
) derived
on 
tblPreorder.A = derived.A
and
tblPreorder.B = derived.B

-- Add the fictional table to the preorder.
-- For every table not less than some other table in the 
-- preorder, add an 'arrow' from it to the fictional table.
-- That way, every non-fictional table is now less than 
-- some other table (possibly fictional).
INSERT INTO 
tblPreorder(A,B)
SELECT DISTINCT 
B AS A, 
@TableName AS B
FROM tblPreorder
WHERE
B NOT IN (
SELECT A FROM tblPreorder
)

-- Create answer table
CREATE TABLE tblAnswer(
[Level]	INT,
A		VARCHAR(256)
)

-- Insert into answer table those tables that aren't 
-- less than any other table in the preorder.
-- Otherwise, they'll get missed in the loop below.
-- Of course, this will just be the fictional table.
-- Choose @NumTables as the level, since the levels of all 
-- other tables will have to be smaller, as a subsequent
-- recursive loop will show.
INSERT INTO tblAnswer(Level,A)
(
SELECT DISTINCT @NumTables AS 'Level', B
FROM tblPreorder
WHERE
B NOT IN (
SELECT A FROM tblPreorder
)
)

-- Recursively add tables in column A of preorder table 
-- to answer table if they don't appear anywhere in column B 
-- of preorder table, while deleting the rows of the preorder
-- table in which they appeared (so we don't do this again). 
-- Note that the fictional table guarantees that those
-- other tables not less than any other won't be prematurely 
-- dropped by the above deletions, preventing them
-- from eventually joining the answer table.
-- Furthermore, each table is added to the answer
-- table as soon as no table is less than it after
-- such deletions (just-in-time sequencing).
-- Increment the table level for each pass,
-- which cannot exceed the number of tables minus 1
-- since we're starting from 0.
SET @Level = 0
WHILE (SELECT COUNT(*) FROM tblPreorder) > 0
	BEGIN

	-- Add tables
	INSERT INTO tblAnswer(Level,A)
	(
	SELECT  @Level AS 'Level',A
	FROM tblPreorder
	WHERE
	A NOT IN (
	SELECT B FROM tblPreorder
	)
	)

	-- Delete them
	DELETE
	FROM tblPreorder
	WHERE
	A NOT IN (
	SELECT B FROM tblPreorder
	)

	-- Increment table level
	SET @Level = @Level + 1

	END

-- Delete fictional table from answer table
DELETE
tblAnswer
WHERE
A = @TableName

-- Display answer
SELECT DISTINCT * FROM tblAnswer

-- Drop tables
DROP TABLE tblPreorder
DROP TABLE tblAnswer
I've used permanent tables to compute the answer (instead of table variables) so readers can verify the computations as the script proceeds (the last two lines will need to be deleted for this). Furthermore, since preorders appear naturally in computing, the script can be used for other purposes where you want to know the level of each object in such relationships (simply replace the snippet that uses SQL Server's sys.foreign_key_columns with one of your own). For the mathematically inclined, this exercise determines the maximum cardinality of ancestor chains for every point in a preorder.

This technique was used to join 17 faculty departments in a major university, where special processing was required on each lookup table (e.g. academic titles) since incompatible values were used before they were merged.