Monday, March 14, 2005

 

Path Enumeration using Prime Number Products

Managing Hierarchies in Relational Database Environment is always a challenging task. Over the years many models like Adjacency List Model, Nested Set Model, Nested Interval Model, Path Enumeration Model etc. were proposed and implemented. There are also other popular methods using recursion and user defined function. SQL Server 2005 T-SQL Extensions has special operators for managing recursive queries. Joe CELKO has written a book exclusively on Managing Trees and Hierarchies.

The basic Hierarchical operations in a typical Employee-Manager type of relation, include

1. Add a new Employee
2. Show the whole Hierarchy with levels.
3. Show the superiors of a given employee.
4. Show the subordinates of a given employee.
5. Show the immediate subordinates of a given employee.
6. List all the Leaf Level Employees.
7. Change the manager of an Employee
8. Delete an Employee and all his subordiantes.
9. Delete an Employee.

Here I am trying to propose a simple model which works almost like Path Enumeration model. This model uses a Prime Number table and basic mathematical operations to answer the above questions. Also the Model only uses set based solutions and can be implemented as a self managing system with the help of triggers.

This model works on two basic concepts.

1. There is a unique path to every node in a hierarchy from the root node.
2. The Prime Number that can be a divisor of the Product of a set of Prime Numbers, are only the Prime numbers in the set.

That is, Let product of n prime numbers be Px, then, the factors of Px can only be the Prime Numbers that participated in the multiplication, or the sub products.
So the product cannot be divided by any other Prime Number.

For eg.

P1, P2, P3,...Pn is the list of Prime Numbers.

Let P be the Product.

Then P1 * P2 * P3 * ... * Pn = P

Then the factors of P can be either P1, P2, P3,...Pn Or the sub product of P1, P2, P3,...Pn.


Now let's try to answer the above questions.

The following is the table used in the examples.

CREATE TABLE dbo.EmpHierarchy
(EmpId int PRIMARY KEY,
EmpName VARCHAR(15),
ManagerId int,
HKey bigint UNIQUE,
Path bigint)

GO

The purpose of the first three columns are obvious. The column Hkey is a Unique Prime Number assigned to each employee. The column Path Holds the product of the Hkey values of the current node and nodes above the current node.

The script used for creating and Populating the PrimeNumber table is here.

1. Add a new Employee
For inserting a new node (employee), we will get the least available Prime Number and use it as the Hkey of the node. The Path will be the Product of the Hkey and the Path of the parent node. For the root level entry the path will be the Hkey itself.

Here is the script for the stored procedure dbo.InsertNode

Now lets insert some sample data.

EXEC dbo.InsertNode 1,'Albert', NULL
EXEC dbo.InsertNode 2,'Bert', 1
EXEC dbo.InsertNode 3,'Chuck', 1
EXEC dbo.InsertNode 4,'Dona', 3
EXEC dbo.InsertNode 5,'Eddie', 3
EXEC dbo.InsertNode 6,'Fred', 3

SELECT * FROM EmpHierarchy

Now we have this data in the table.

EmpIdEmpNameManagerIdHKeyPath
2Bert136
3Chuck1510
4Dona3770
5Eddie311110
6Fred313130

2. Show the whole Hierarchy with levels.
The level of a node in a hierarchy depends on the number of levels above it. So the level of a given node can be found by counting the levels above it. Thats the technique used in the stored procedure dbo.PrintHierarchy. The SP also accepts an EmployeeId, to display a partial hierarchy.

EXEC dbo.PrintHierarchy

Here is the output
HierarchyLevelEmployeeManager
-1AlbertNULL
--2BertAlbert
--2ChuckAlbert
---3DonaChuck
---3EddieChuck
---3FredChuck

3. Show the superiors of a given employee.
The path of a given employee is the product of the Hkeys of his superiors and his Hkey. As per our rule the path can be divided only by the Prime Numbers participated in it. Based on that finding the superiors logic is failrly easy. You can get the list of the superios of a given employee by

SELECT A.EmpId, A.EmpName
FROM EmpHierarchy A
WHERE [Path of the given Employee] % A.Hkey = 0


I have this Stored Procedure dbo.getSuperiors which accepts an Employee Id and return the list of his superiors.

To get the Superiors of 'Fred', execute

EXEC dbo. getSuperiors 6

Here is the output
EmpIdEmployeeManager
6FredChuck
3ChuckAlbert
1AlbertNULL

4. Show the subordinates of a given employee.
The Hkey of a given employee will be a factor of the Path of all his subordinates and only his subordinates.
So to get the list of the subordinates of an employee, you can run a query like

SELECT A.EmpId, A.EmpName AS Employee
FROM EmpHierarchy A
WHERE A.Path % [HKey of the given Employee] = 0

Here is the stored procedure dbo. getSubordinates for doing the same.

Now to get the list of subordiantes of 'Chuck', execute

EXEC dbo. getSubordinates 3

Here is the output
EmpIdEmployeeManager
3ChuckAlbert
4DonaChuck
5EddieChuck
6FredChuck

5. Show the immediate subordinates of a given employee.
The path of the immediate subordiantes of a given employee will be the product of the path of the manager and the HKey of the subordinate.
So one can find the immediate subordinates of a given employee by writing a query like

SELECT EmpId, EmpName
FROM EmpHierarchy
WHERE Path = HKey * [Path of the Given Employee]

For eg, to get All the immediate subordinates of Albert, the query will be

SELECT EmpId, EmpName
FROM EmpHierarchy
WHERE Path = HKey *2 --Path of Albert

Here is the output

EmpIdEmpName
2Bert
3Chuck

6. List all the Leaf Level Employees.
Leaf level employees are those who dont have any subordinates. So finding them is a trivial task.
The following query gives you the list of leaf level employees.

SELECT EmpId, EmpName
FROM EmpHierarchy A
WHERE NOT Exists(SELECT 1 FROM EmpHierarchy T
WHERE T.Path % A.Hkey = 0
AND T.Path > A.Path)

And here is the output.
EmpIdEmpName
2Bert
4Dona
5Eddie
6Fred

7. Change the manager of an Employee
Updating can be of two types, promotion and demotion. When updating an Employees manager, you need to update the rows of the given employee and all his subordinates Path. Also you need to perform some validation to avoid cyclic reference. Here is the stored Procedure dbo.UpdateNode to update a given node.

So now to change Fred's Manager to Bert from Chuck, you can Execute

EXEC dbo.UpdateNode EmployeeId, NewManagerId

EXEC dbo.UpdateNode 6, 2

After the update, the hierarchy will look like

EXEC dbo.PrintHierarchy
HierarchyLevelEmployeeManager
-1AlbertNULL
--2BertAlbert
--2ChuckAlbert
---3DonaChuck
---3EddieChuck
---3FredBert

Now, lets try a promotion. Eddie directly reports to Albert instead of Chuck

EXEC dbo.UpdateNode 5, 1

This is how the hierarchy will look like now.

EXEC dbo.PrintHierarchy
HierarchyLevelEmployeeManager
-1AlbertNULL
--2BertAlbert
--2ChuckAlbert
--2EddieAlbert
---3FredBert
---3DonaChuck

Finally lets demote chuck one level down the hierarchy. Instead of reporting to Albert, he will now report to Bert

EXEC dbo.UpdateNode 3, 2

EXEC dbo.PrintHierarchy
HierarchyLevelEmployeeManager
-1AlbertNULL
--2BertAlbert
--2EddieAlbert
---3FredBert
---3ChuckBert
----4DonaChuck

8. Delete an Employee and all his subordiantes.
Deleting a subtree is fairly easy. we can use the same logic of finding subordinates and apply that in the where clause.
Here is the stored procedure dbo.deleteSubTree

Now to fire the whole team of Chuck, you can execute

EXEC dbo.deleteSubTree 3 --EmployeeId of chuck

After the deletion, the Hierarchy will look like

EXEC dbo.PrintHierarchy
HierarchyLevelEmployeeManager
-1AlbertNULL
--2BertAlbert
--2EddieAlbert
---3FredBert

9. Delete an Employee.
This can be more tricky, because when you delete an Employee, If he has subordinates, the hierarchy will be broken. So all the subordinates of the deleted employee should report to the Manager of the deleted employee. Here is the Stored Procedure dbo.deleteNode, which takes care of it.

So Now lets try deleting Bert.

EXEC dbo.deleteNode 2

After the deletion, the hierarchy will look like
HierarchyLevelEmployeeManager
-1AlbertNULL
--2EddieAlbert
--2FredAlbert

And thats it :)

These are the pro's and con's of the model from my point of view.

Pros
1. Use only standard SQL constructs.
2. Use only set based solutions.
3. Can be implemented as self manageable using triggers.
4. Gives better performance because it uses Numeric operations instead of string operations.

Cons
1. For larger Hierarchies, you may encounter out of range errors because the Path value grows in an exponential rate.
Still you can replace the bigint with a larger datatype like numeric and replace the % operator with division operator and overcome the issue.

AS I mentioned earlier, this model is still in its evolving face. So as always, your feedback is highly appreciated.

Comments:
I think it will work, and it's an interesting approach. I don't
think it's going to be quite as practical in terms of efficiency,
though, since you're just storing the set of ancestors, and not the
actual path. You have to look up the level of each factor/ancestor to
reconstruct the path.

The positive aspect is that it relies on simple mathematics, even if
it is not typically implemented for such large integers in an RDBMS.

I wonder if you couldn't get the same benefits but do it more
practically by storing a bitmap instead of an integer.
Mathematically, I mean instead of storing the product of Prime(i) over
all ancestors, store the sum of Power(2,i) over all ancestors. Just
as you can recover the primes in a product, you can recover the
i-values in a sum of distinct Power(2,i) values. The only thing you'd
lose is the ability to put a given number into the set more than once
(products of primes can represent any multiset of i-values, where sums
of Power(2,i)s can only represent sets). There might be some
generalization of hierarchy for which storing the multiset might help,
but I can't think of one at the moment.

So basically, I think it's a nice solution that will get people
thinking about this interesting problem, but it's not likely one that
will catch on in practice. I might have missed something, and if I
think of anything else, I'll let you know.

Steve Kass (by mail)
 
Looks like an interesting idea that I haven't seen before. For
smaller hierarchies it seems feasible. Moving subtrees is easier than
under nested sets because you will typically need to touch fewer
nodes.

Scalability is obviously quite limited by the size of the numbers and
by the fact that the arithmetic operations won't optimize well in
queries.

I haven't tried out your code samples yet but I'll take a look at
them later.

David Portas (by mail)
 
You have a very interesting technique!

I'm concerned about two things:

A) Sargability of the WHERE predicates using the % operator -- have you tested this on any large trees? Won't the % operator force a table scan?

B) Extensibility -- what is the largest prime number that can be represented in 32 or 64 bits (INT or BIGINT respectively)? How many prime numbers can you represent, and doesn't that limit the tree size? I'm not sure what the upper bound is here. Have you thought about it?


Thanks again for sending this -- good food for thought!

Adam Machanic (by mail)
 
I see two main limitations in your model (contrary to what you say in your pros/cons section).



* Product of primes: you say that large trees can be supported if you use a large numeric datatype.

The largest numeric that can be represented in SQL Server is numeric(38,0).

Unless the tree is small in general or with a low degree (small number of levels), you will get overflows pretty soon.



* Performance: you say that performance is good because you use arithmetic operations and indexes can be used.

However, since many of the queries are based on manipulation of the filtered column and not the base column itself (e.g., path % @hkey = [some_value]), the filter is not SARGable, and you get table scans.



Nevertheless, the model is very interesting intellectually.

Itzik Ben-Gan (by mail)
 
I have just seen your blog on Path Enumeration and it is pretty cool. I am personally a simplicity guy, using simply recursive calls to find out this kind of thing, but I do see the power in this. It also proves that I am one of the lesser imaginative persons out there, because I would have never dreamed of this, and frankly will probably only every barely understand it :)

If you get a blog site that supports RSS I would certainly subscribe and read your stuff.

Louis Davidson (by mail)
 
My first comment can only be that I really like this approach. I can see Joe Celko starting to curse because he has to start a new edition of Trees and Hierarchies, to include this technique :).

That being said, I also want to add some "cons" to the list of pros and cons.

2. Since almost all searches are based on the modulo operator, there is no way that searches can be sped up using an index. This will limit scalability. (I'm not sure why you write, as fifth pro, that this model makes better use of indexes).

3. Changes "high" in the hierarchy will result in many rows being changed. If these changes occur often, concurrency may suffer. (It's not nearly as bad as the ensted sets model, though!)

4. For larger numbers, it's not very intuitive. For a human operator, it's not easy to see that the number 318495391 is equivalent to 17 * 53 * 151 * 2341.


Another thought: I found that this model is suited for storing multiple hierarchies in one table (e.g. you can add George (boss), Herbert (managed by George) and Isaac (managed by Herbert) and it works okay). What I didn't think through, nor test, is what happens if I set the path for Isaac to Herbert's path * Bert's path (so that Isaac become an employee in two otherwisse independant hierarchies). I'd advise you to run some tests on this scenario - maybe you'll find that this breaks things down, but you might just as well find that this opens endless new possibilities!!


Now it's time to pick some nits (note - I didn't test all your scripts; only a selection. I did read them all, though).

* You haven't tested your code on a database with a case-sensitive collation. I use binary collation and I had to change the case of column and variable names several times.
* In the output after the initial load of the data, the row 1 Albert NULL 2 2 is missing.
* The output of PrintHierarchy is sorted by level only. In the first output, it doesn't matter. But after changing Fred's manager to Bert from Chuck, the ouptu you get is
Hierarchy Level Employee Manager
- 1 Albert NULL
-- 2 Bert Albert
-- 2 Chuck Albert
--- 3 Dona Chuck
--- 3 Eddie Chuck
--- 3 Fred Bert

whereas the output I'd like to get is
Hierarchy Level Employee Manager
- 1 Albert NULL
-- 2 Bert Albert
--- 3 Fred Bert
-- 2 Chuck Albert
--- 3 Dona Chuck
--- 3 Eddie Chuck

I'm not sure if it's easy to adapt dbo.PrintHierarchy to produce this output.
* There's an extraneous space in EXEC dbo. getSubordinates 3
* In the CREATE TABLE script, you've forgotten the NOT NULL requirements.
* You've got a syntax error in the script to create prime numbers: "WHILE @cnt <>" should read "WHILE @cnt <> 1000". Besides, this script is very slow. I'll attach a script that uses Eratosthenes' sieve to find prime numbers much quicker (test results: first 2762 primes [up to 25,000] took 11,346 ms with your script; 750 ms with mine - I didn't wait for the test for 10000 primes to finish)
* Your declarations are not always the same - you use int, bigint and numeric(32,0). Stick to just one type (bigint) to prevent unneeded conversions.
* In the InsertNode procedure, you use INSERT without column list. A definite no-no in production code.
* The UpdateNode procedure won't allow you to appoint a new CEO (i.e. dbo.UpdateNode 4, NULL will fail) I probably should have tested your anti-cyclic-reference check, but I lack the time for that (I guess I'll just trust your and Steve's opinion on this :-)
* I didn't really test it, but I suspect that the DeleteNode procedure might fail when you delete the CEO. Also, one of the comments has a typo: subordiantes instead of subordinates.

That's it. I hope it was useful.

Hugo Kornelis (by mail)
 
wish I had seen this before finishing TREES & HIERARCHIES IN SQL.
The only problem I can see is the size of the primes as the trees get
larger, but we are living in a 64-bit world now. Since the math is
simple integer division and multiplication, the speed is probably
pretty good.

Random thought: if we give each node a prime in a general graph, then a cycle.
 
Could the access path be indexed? Imagine that we don't have this silly
64 bit limit and you are presented with a node encoded with a number

31074182404900437213507500358885679300373460228427
27545720161948823206440518081504556346829671723286
78243791627283803341547107310850191954852900733772
4822783525742386454014691736602477652346609

(RSA-640) and are asked what are the ancestors. Can you answer that in
a reasonable amount of time.

I'm cheating of course, if the primes are small you can factor the
number reasonably fast. Still, what about the access path? For
comparison, in case of nested sets we are talking about index range
scan (at least one way).
 
The novelty of Roji Thomas' approach is that his
method (developed into much greater detail, BTW) uses non-volatile encoding.
 
Roji Thomas' schema looks like materialized path in disguise. Let

ithprime(i)

be a function such that it returns the ith prime number, where the
first prime number is 2. (BTW, there is such a function in Maple). Then
let factor any number N

N = 2^n1 * 3^n2 + ... + ithprime(i)^ni

A sequence of integers (n1, n2, n3,...) in Roji Thomas' encoding case
is a boolean vector. The "divides" relation "|" for numbers is "greater
than" relation "<" defined component-wise. For example,

2 | 2*5*13 <==> (1,0,0,...) < (1,0,1,0,0,1,...)

Like materialized path, one can index strings of 0s and 1s effectively.
 
An intelligent mathematical approach
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?