Mathematical Auditing Of Email Broadcasts

R Glen Cooper

An earlier version of this article first appeared in www.sqlservercentral.com (22 May 2009).

Overview

Email broadcasting is nothing more than re-directing emails to multiple addresses, but it's a valuable tool for many companies. However, maintaining broadcast lists can be surprisingly difficult. Customers sometimes receive multiple copies of the same email (for no obvious reason) or none at all. A frequent complaint is that recipients ask to be removed from a mailing list but continue to receive them even after the sender attempts manual corrections. This article will show two ways in which subtle errors might occur, and offers a simple method for cleaning up broadcast tables using a bit of mathematics and a really nice feature of SQL Server 2005.

Although mailing lists are conceptually simple, several people may be independently maintaining them (using private spreadsheets, for example). This can cause many errors. And some of them aren't even evident until after the re-directions have been collated into a master list (e.g. two people independently putting the same re-direction into their own spreadsheets).

To show how this can happen, take a fictitious company called My Company whose email re-directions are the following (after Sales, Marketing, and Admin have submitted their re-directions to their email administrator):

bill@MyCompany.com
->
bill@Microsoft.com
SALES@MyCompany.com
->
promotion@MyCompany
SALES@MyCompany.com
->
ADMIN@MyCompany.com
SALES@MyCompany.com
->
janet@IBM.com
SALES@MyCompany.com
->
accounts
ADMIN@MyCompany.com
->
accounts
ceo@MyCompany.com
->
bill@Microsoft.com
info@MyCompany.com
->
webmaster
onsite@MyCompany.com
->
guest@MyCompany.com
guest@MyCompany.com
->
janet@IBM.com
advertising@MyCompany
->
MARKETING@MyCompany.com
promotion@MyCompany
->
advertising@MyCompany
president@MyCompany.com
->
john@MyCompany.com
MARKETING@MyCompany.com
->
president@MyCompany.com
MARKETING@MyCompany.com
->
SALES@MyCompany.com

Here, any email sent to bill@MyCompany.com would be automatically re-directed to bill@Microsoft.com (parent company). SALES@MyCompany.com is a mailing list whose recipients are promotion@MyCompany, ADMIN@MyCompany.com (another mailing list), janet@IBM.com (customer and guest), and accounts (a user account on the company's email server).

Already there are two things wrong with this broadcast (you did see them, right?):

Email returns to sender:

MARKETING@MyCompany.com
->
SALES@MyCompany.com
->
promotion@MyCompany
->
advertising@MyCompany
->
MARKETING@MyCompany.com

Recipient occasionally receives multiple copies of the same email:

SALES@MyCompany.com
->
accounts
SALES@MyCompany.com
->
ADMIN@MyCompany.com
->
accounts

The scripts below will detect these problems in any email broadcast.

Note that when correcting such problems, re-directions can't simply be deleted if the owners of the re-directed addresses still want them re-directed. Furthermore, eliminating duplicate records on those re-directions, or records with the same target address, won't necessarily fix them.

A simple example demonstrates this:

Suppose we have one broadcast submitted by Arnold:

A --> b
A --> c

and two re-directions submitted individually by Bob and Calvin (which you can't delete on the email server because these people don't want any emails):

b --> A
c --> b

Although A is a valid email address, it's meant to represent a mailing list (eg. Customers@myCompany.com) while b, c are real people (eg. bob@myCompany.com and calvin@myCompany.com).

There are no duplicates in this list, but both circularity and redundancy exist.

We could remove the second re-direction to avoid duplicate targets, but then Calvin would start getting emails.

If we expand all mailing lists on the right hand side, we would get:

A --> b
A --> c
b --> b
b --> c
c --> b

Eliminating self re-directions gets us:

A --> b
A --> c
b --> c
c --> b

Since we can't eliminate the last re-direction submitted individually by Calvin, how do we get rid of the duplicates on the right hand side?

We could eliminate b --> c, but then Bob's now getting emails.

This extreme example shows that some compromise to the original mailing list and re-directions might be required to fix the problems, such as asking Bob to re-direct his emails elsewhere:

A --> c
b --> d
c --> b

Of course, you would need to re-run the scripts to see if any new problems are now introduced.

Although the scripts below will detect these problems, eliminating them requires human intervention (and perhaps consultation with the affected parties). Automating this process would be an interesting problem in its own right.

How the scripts work

For notational simplicity, label the email addresses from 1 to 16:

1
bill@MyCompany
2
ceo@MyCompany.com
3
info@MyCompany.com
4
onsite@MyCompany.com
5
guest@MyCompany.com
6
advertising@MyCompany
7
promotion@MyCompany
8
president@MyCompany.com
9
MARKETING@MyCompany.com
10
SALES@MyCompany.com
11
ADMIN@MyCompany.com
12
bill@Microsoft.com
13
janet@IBM.com
14
webmaster
15
john@MyCompany
16
accounts

Then the 15 re-directions defining the broadcast look like this:


What a difference a picture makes!

Now it's easy to see why 9 (Marketing) eventually re-directs mail back to itself and why 16 (accounts) always gets multiple emails sent to 10 (SALES). For some reason ADMIN (11) wasn't aware that SALES was already re-directing emails to accounts (and other addresses). For many organizations there could be hundreds of re-directions from a single email, which is why manual tinkering is ineffective.

To diagnose these pictures, let's view them as directed graphs where the above problems can be formulated in a natural way. For an overview of graphs, see
en.wikipedia.org/wiki/Graph_theory or use www.newdalesystems.com/resources/resources01/resources01.htm  to build them randomly on your Windows PC.

Recall that a directed graph is any finite set of elements (called vertices), together with a set of directed edges between them. Multiple edges between any pair of vertices are not permitted, as well as edges involving the same vertex. The scripts below will assume these conditions to keeps things simple. In any event, they're easy to test.

By definition, for any directed edge A -> B between vertices A and B, the element A is called the LHS ("left-hand side") and B is called the RHS ("right-hand side") of the edge.

Lists of re-directions (i.e. single broadcast) will be represented by a table whose columns are LHS and RHS. Later we'll see why this edge-oriented approach is a natural choice over its vertex-oriented alternative.

For directed graphs, any finite sequence of directed edges V1 -> V2 -> … -> Vn on the vertices V1, V2, … Vn is called a path (of length n-1), provided that the Vi are distinct (except that V1 may be Vn). A path is circular if V1 = Vn.

The scripts below will test any broadcast for the following conditions:

There are no circular paths

At most one path exists between any two vertices

Showing that a directed graph has no circular paths is fairly straightforward.

First, a simple definition. The out-degree of a vertex V is the number of edges of the form V -> W. That's nothing more than the number of edges leading from V to other vertices. In-degrees are defined similarly.

If the graph has no circular paths, there must be a vertex whose out-degree is 0, because the number of vertices is finite.

Our test will attempt to list the vertices in the following way. Choose a vertex whose out-degree is 0 as the first member of the list. Remove it and all edges into it, and repeat for the remaining graph. This process will eventually choose all of the vertices if, and only if, there is no circular path in the original graph (to prove this use induction on the number of vertices). Note that if you remove a vertex on the RHS of an edge coming from another vertex with no other edges, then that vertex can also be dropped at this stage because the next iteration will drop it anyway. But since we are storing graphs as edges, this will automatically happen (hence the edge-oriented approach).

For the above example, the vertices with out-degree = 0 are first chosen:

12, 13, 14, 15, 16

After removing these vertices and their edges, the vertices on the remaining graph with out-degree = 0 are chosen:

5, 8, 11

After removing these vertices and their edges, the remaining graph still contains the vertices 9, 10, 7, 6 but none have out-degree = 0. So the test fails, and the remaining graph must have a circular path which, in this case, traverses all of the remaining vertices. Note that if there were three more vertices connected in the following way 17 -> 18 -> 19 -> 17 then there would be two circular paths, but neither would traverse all of the remaining vertices. So this algorithm won't produce a circular path - it only tests whether one exists. But to find one, simply choose any vertex that remains and follow its edges until it returns to some vertex already traversed. This will always happen since there are only finitely many vertices, and none of them have out-degree = 0. Note that the vertex you choose may not be part of the circular path that you found this way. In fact, it won't be part of any circular path if its in-degree = 0.

The script for the circularity test is the following:

/*
Script:	Circularity test for email lists 
Platform: SQL Server 2005
Purpose: This script accepts a table of email 
	re-directions and determines if 
	circularity exists.
	Redirections are represented by a table called Edge
	whose columns LHS and RHS represent the left and 
	right-hand sides of a re-direction.
Output: List of re-directions containing all circular paths
Comment: This test will destroy Edge, so a working copy is used
*/

-- No counting
SET NOCOUNT OFF

-- Make working copy of Edge
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[EdgeTemp]') AND type in (N'U'))
	DROP TABLE [dbo].[EdgeTemp]
SELECT LHS, RHS INTO EdgeTemp FROM Edge
CREATE INDEX NDX_EdgeTemp_RHS ON EdgeTemp(RHS)

-- Recursively delete edges whose RHS vertices have out-degree = 0.
-- Note that the RHS of such edges will disappear from Edge.
-- So will the LHS of such edges if they have out-degree = 1.
-- Note that @@ROWCOUNT = 0 at this point.
DELETE	
FROM EdgeTemp
WHERE RHS NOT IN (SELECT DISTINCT LHS FROM EdgeTemp)
-- If @@ROWCOUNT > 0 at this point, then continue
WHILE @@ROWCOUNT > 0
	DELETE
	FROM EdgeTemp
	WHERE RHS NOT IN (SELECT DISTINCT LHS FROM EdgeTemp)

-- Original graph will have no circular paths 
-- if and only if there are no remaining edges
-- in its working copy.
-- All edges will have out-degree > 0 at this point.
SELECT * FROM EdgeTemp

-- Drop working table
DROP TABLE EdgeTemp
This test outputs a list of edges containing all circular paths (which will be empty if there aren't any).

Showing that at most one path exists between any two vertices can be tested in the following way: Copy each edge to a temporary collection. Initially, this collection can be viewed as the LHS and RHS of all paths of length 1. Then match the RHS of each edge of the graph with the LHS of any member in the collection, adding the LHS of the edge and the RHS of the member to the collection. Now we have the LHS and RHS of all paths of length 1 or 2. Repeat this matching (using just the newly-introduced members of length 2) to get the LHS and RHS of all paths of length 1, 2, or 3. Continue this way until no more new edges are available for the next iteration of this recursion. After that, no duplicate records exist in the temporary collection if, and only if, there is at most one path between any two vertices of the original graph.

Now if you're still awake, notice that the second test assumes that there are no circular paths! Otherwise, you'll go into an infinite loop. So the first test must succeed before the second test is even performed.

So let's change the above example by truncating the edge 9 -> 10. Now there are no circular paths and we can perform the second test:

The first iteration produces the LHS and RHS of all paths of length 1 (i.e. all edges):

1 -> 12
2 -> 12
3 -> 14
4 -> 5
5 -> 13
6 -> 9
7 -> 16
8 -> 15
9 -> 8
10 -> 7
10 -> 11
10 -> 13
10 -> 16
11 -> 16

The next iteration produces the LHS and RHS all paths of length 2:

4 -> 13
6 -> 8
7 -> 9
9 -> 15
10 -> 6
10 -> 16

The next iteration produces the LHS and RHS all paths of length 3:

6 -> 15
7 -> 8
10 -> 9

The next iteration produces the LHS and RHS all paths of length 4:

7 -> 15
10 -> 8

The final iteration produces the LHS and RHS all paths of length 5:

10 -> 15

Note that duplicate LHS, RHS combinations produced in a same iteration must represent different paths (use simple induction on the iteration level to prove this). But paths represented by those LHS, RHS combinations in different iterations have different lengths (by construction) so they must be different paths. Therefore, duplicate LHS, RHS combinations always represent different paths between the same pair of vertices. So in our truncated example, there are two paths from 10 to 16 and no other redundancy exists.

We will use SQL Server 2005's elegant Common Table Expressions to seek duplicates while recursively listing the LHS, RHS combinations. This test will output all pairs of vertices with multiple paths between them, and count how many there are. Optionally, you may use the diagnostics option to list the output for each iteration.

The script for the redundancy test is the following:

/*
Script: Redundancy test for email lists 
Platform: SQL Server 2005
Purpose: This script accepts a table of email 
	re-directions and determines if 
	any recipient received multiple emails.
	Redirections are represented by a table called Edge
	whose columns LHS and RHS represent	the left and 
	right-hand sides of a re-direction.
Output: Number of redundant paths between any pair of emails having them			
Note: Assumes no circular paths with re-directions.
	Otherwise, script will go into an infinite loop.
	Use sister script to check this beforehand.
*/

-- No counting
SET NOCOUNT ON

-- Determine redundant paths between any pair of emails, and count them
;WITH Collect AS 
( 
-- Collect the LHS and RHS of all paths of length L = 1 (i.e. all edges).
-- Use i to record iteration for diagnostics.
SELECT i= 1, LHS, RHS
FROM Edge
UNION ALL	-- Must use ALL because duplicates will always represent different paths 
-- Collect the LHS and RHS all paths of length L 
-- by joining re-directions with previously collected paths
-- in the collection of length L - 1 
SELECT i = i + 1, Edge.LHS, Collect.RHS
FROM Edge INNER JOIN Collect
ON Edge.RHS = Collect.LHS
)
-- Output result.
-- Set MAXRECURSION to 0 to allow unlimited recursion.
SELECT LHS, RHS, COUNT(*) AS NumPath FROM Collect GROUP BY LHS,RHS HAVING COUNT(*) > 1 OPTION (MAXRECURSION 100)
-- Diagnostics:
-- SELECT i, LHS, RHS FROM Collect ORDER BY CAST(i AS INT), CAST(LHS AS INT), CAST(RHS AS INT) OPTION (MAXRECURSION 100)
Finally, here's a simple script to produce random graphs for testing:

/*
Script: Produce random directed graph 
Platform: SQL Server 2005
Purpose: This script builds a random directed graph called Edge
*/

-- No counting
SET NOCOUNT ON

-- Declarations
DECLARE @Vertex1	INT
DECLARE @Vertex2	INT

-- Create Edge
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'Edge') AND type in (N'U'))
	DROP TABLE Edge
CREATE TABLE Edge(
	LHS varchar(255),
	RHS varchar(255)
) ON [PRIMARY]

-- Initialise the first vertex
SET @Vertex1 = 0

-- Enumerate all pairs of vertices from 1 to 10, 
-- where the second vertex is less than the first,
-- and create a directed edge with 20% probability
-- on each pair chosen (50% either way for those chosen).
-- Note that higher probabilities will create more edges.
WHILE @Vertex1 < 10	-- Number of vertices to build  
	BEGIN
	SET @Vertex1 = @Vertex1 + 1
	SET @Vertex2 = 1
	WHILE @Vertex2 < @Vertex1
		BEGIN
		-- 20% probability of creating directed edge (i.e. 1 out of 5)
		IF CONVERT(INT, 5*RAND()) = 1	
			-- 50% probability direction goes this way (i.e. 1 out of 2)
			IF CONVERT(INT, 2*RAND()) = 1	
				INSERT INTO Edge(LHS,RHS) VALUES(@Vertex1, @Vertex2)
			ELSE
				-- 50% probability direction goes that way 
				INSERT INTO Edge(LHS,RHS) VALUES(@Vertex2, @Vertex1)					
		SET @Vertex2 = @Vertex2 + 1
		END
	END

-- Index Edge
CREATE INDEX IDX_Edge_RHS ON Edge(RHS)

-- Display graph
SELECT * FROM Edge
After producing a random graph, run the first test for circularity. If it's successful (and only if it's successful), run the second test for redundancy. Otherwise, produce another random graph and start over.

Randomly generated graphs readily produce circular paths, unlike those built manually from real broadcasts, where most emails are directed to external sites (where we know nothing about their re-directions). You can adjust the probabilities to influence how many circular paths should be created. Of course, performance will be an issue with the second test on large non-sparse graphs since the expected number of paths goes up very quickly with the number of edges. Fortunately, real broadcasts are relatively sparse and well-behaved. In practice, however, even one redundancy is one too many since its affect on customers can be quite annoying.

For non-circular graphs, the second test can be altered slightly to detect addresses that won't receive any emails, or if no path exists between any pair of selected vertices.

The above scripts are quite general, and may be applied to any situation where hierarchies exist in the form of directed graphs.

For example, some organizations have emergency fan-out lists so everyone can be quickly located. If a directed graph defines who may communicate with whom, then fan-out lists are nothing more than so-called spanning trees containing each vertex exactly once. The above techniques are useful here, especially since the graph will change whenever personnel go on vacation, or new ones are hired.

An enhanced version of these scripts is used by a mid-size marketing company to verify the integrity of their broadcasts, which change monthly. Users submit re-directions to their email coordinator, who verifies the combined list before sending it to their email administrator. And that saves phone calls and manual debugging.