Storing and Querying Hierarchical Data in MySQL — Uplines & Downlines
UPDATE 27/08/2021: Updated the title to specifically mention MySQL. Other DBMS such as PostgreSQL might have built-in structures to deal with hierarchical data. Also added another alternative to solve this problem which is compatible with older versions of MySQL.
A lot of you might be familiar with the term Multi-level Marketing (MLM) or pyramid schemes. It is quite a controversial topic but instead of talking about it from the marketing or business (or ethics) perspective, I am going to gloss over the technical details of building such system — in particular the implementation of uplines and downlines in relational database systems.
For the uninitiated, MLM is a marketing system such that each user can make sales and also recruit other users who can also make sales. The user who recruits another user(s) is called the parent and the recruitee is called the child. Each child can also recruit other users and thus be a parent themselves. This creates a tree-like structure.
Another characteristic with this downline tree is that for each sale that is made by a user, his/her parents will also get a commission from that sale. This incentivizes each user to recruit as many downline as possible.
The Problem
In this article, I am just going to focus on answering these questions:
- How to find the total number of children a user has?
- How to get a list of all the children a user has?
I will mainly discuss about the ways in which we can solve those problems with the schema we have in the next section, but I am also going to talk briefly about the possible solutions if we can change the schema, or use another schema altogether.
The Schema
For this article, let’s say we have a table for storing users’ data as such
| users |
|---------------------------------------------|
| userID | parentID | rootID | name | phoneNo |
| 1 | - | - | John | +800100 |
| 2 | 1 | 1 | Bob | +800101 |
| 3 | 2 | 1 | Jack | +800102 |
userID, name, and phoneNo are pretty self-explanatory. parentID is the userID of the direct parent of each user. rootID is the greatest-grand-parent of each user. In the case above, user with userID 1 can be called a “root user” since he has no parents and will be the topmost node in a downline tree structure.
Now before we move on to the next section, can you think of some ways we can answer the two main questions presented in the previous section using your RDBMS of choice?
The Solution
Just like any other problem in software development, there are multiple ways to solve this each with their own solutions.
I’m going to explain the solution using MySQL examples (both versions 5.x and 8.x+). The reason I think of this problem is challenging is because I originally had to come up with a solution using MySQL version 5, where it’s harder to solve this problem compared to using MySQL version 8+ or PostgreSQL because of the relative lack of features. I might update this post with a PostgreSQL solution later, but generally each solution can be done in PostgreSQL as well (with slightly different syntaxes).
Common Table Expressions
Note: CTEs are only available on MySQL version 8.0 and newer
Common Table Expressions (CTE) are temporary result sets which we can use to reference in other SQL queries. The great part is that we can create recursive queries with CTEs that reference the result set in the previous recursion. The CTE solution looks like this:
WITH RECURSIVE cte (userID, name, parentID) as (
SELECT userID,
name,
parentID
FROM users
WHERE parentID = X
UNION ALL
SELECT u.userID,
u.name,
u.parentID
FROM users u
INNER JOIN cte
ON u.parentID = cte.userID
)
SELECT * FROM cte;
What’s happening is that we are SELECTing all users with parentID = X (let’s say this is condition 1) and also SELECTing all users with parentID who matches condition 1 (let’s say this is condition 2). We then do a union for both results. Since this CTE is recursive, this query will also find the children of the users who match condition 2 , and also their grandchildren, and so on.
Stored Procedures
Stored procedures are basically functions where we can use procedural statements (such as loops, in PostgreSQL you can even use your favorite programming language) on top of SQL statements. In MySQL there are two variations: CREATE FUNCTION (if you want to have a return value) and CREATE PROCEDURE (if you don’t want to return anything).
With stored procedures, I am going to create a function which returns the list of userIDs we need to display. They are the userIDs of the user whose children we want to find and also the userIDs of his/her children. At the end we will do a SELECT query to find all users whose userID is included in the list of userIDs returned from the function.
DELIMITER $$
CREATE FUNCTION get_downlines(rootID INT, level INT) RETURNS varchar(5000) CHARSET utf8 DETERMINISTIC
BEGIN
DECLARE result VARCHAR(5000);
DECLARE temp VARCHAR(5000);
DECLARE i INTEGER;
SET result = '';
SET temp = CAST(rootID as CHAR);
SET i = 0;
WHILE temp is not null AND i <= level DO
SET result = CONCAT(result,',',temp);
SELECT GROUP_CONCAT(userID) INTO temp FROM users
WHERE FIND_IN_SET(parentID, temp)>0;
SET i = i+1;
END WHILE;
SET result = CONCAT(result,',');
RETURN result;
END;
$$
DELIMITER ;SELECT * FROM users WHERE (SELECT get_downlines(X, MAX_LEVEL)) like CONCAT('%,',CAST(userID as CHAR),',%') AND userID != X;
The logic can be explained as follows:
Let’s say we want to find all children and grandchildren of a user with userID X:
- We want to store the userIDs we found in a string called
result
, where each userID has a leading and trailing “,” character. We also initialize a temporary variabletemp
. This temporary variable is to store the children userIDs in the current iteration.
While the temp
variable is not null (we reach a point where all the users has no children), then do:
- Add the contents of the
temp
variable into ourresult
variable. Since we are using a string we do this by concatenating them. - Find all users whose parentID is included in our
temp
variable. So we’re finding all children of the users in ourtemp
variable. Then replace the contents of ourtemp
variable with the children’s userIDs. - After the loop ends, add a trailing “,” character to our
result
string.
Finally we select all users whose userID is in our result
string, excluding user X itself. If we want to get the total count of children, then we can simply use a SELECT COUNT(*) query instead of SELECT *.
We can also limit the maximum depth of this recursion, as shown in the example above using the level
variable.
The downside of this approach is that we have to use stored procedures and they are arguably hard to debug and maintain (I’m not gonna discuss about that here, maybe I’ll create a separate post on that).
Repeated Self Joins
This is pretty self-explanatory. Basically just keep joining the table n-times where n is the hierarchy depth you want to reach. Below is an example where we search for users that are descendants of userID X, with a search depth limit of 4.
select p4.parentID as parent4ID,
p3.parentID as parent3ID,
p2.parentID as parent2ID,
p1.parentID as parentID,
p1.userID as id,
p1.root_level,
from users p1
left join users p2 on p2.userID = p1.parentID
left join users p3 on p3.userID = p2.parentID
left join users p4 on p4.userID = p3.parentID
where X in (p1.parentID,
p2.parentID,
p3.parentID,
p4.parentID)
order by 4, 3, 2, 1;
You can make the max search depth adjustable in your application code. For example, in Golang you could do something like:
package mainimport (
"fmt"
)func main() {// Set the max depth you want here
maxLevel := 4// Set the userID you want to search here
searchUserID := 1tableString := "p1.parentID"
selectString := "SELECT p1.parentID as parentID, "
orderQuery := " ORDER BY " + fmt.Sprint(maxLevel)
joinQuery := ""for i := 1; i < maxLevel; i++ {
recTable := "p" + fmt.Sprint(i+1)
dirTable := "p" + fmt.Sprint(i)
orderQuery += ", " + fmt.Sprint(maxLevel-i)
selectString = selectString + recTable + ".parentID as parent" + fmt.Sprint(i+1) + ", "
tableString = tableString + ", " + recTable + ".parentID"
joinQuery += " LEFT JOIN users " + recTable + " ON " + recTable + ".userID = " + dirTable + ".parentID "
}
selectQuery := fmt.Sprintf(selectString + "p1.userID")
whereQuery := fmt.Sprintf(" WHERE %d in (%s)", searchUserID, tableString)fmt.Println(selectQuery + joinQuery + whereQuery + orderQuery)
}
And to get the total number of downlines, simply use a COUNT(*) for the query above.
This method has the added benefit of filtering your search results for a specific level only. For example, you might want to find the level 2 children only, which you can do by adding a WHERE p2.parentID = X
. In addition this is also compatible with all versions of MySQL so this is my preferred solution for older versions of MySQL.
Using a Different Table Structure
When I faced this problem, the table structure was already defined like in The Schema section and I had to make do with what I got.
But what if we are yet to decide on a schema or table structure? Or what if we can alter or add new columns?
One solution would be to add a hierarchyList column, where we store the hierarchical structure of each user and his/her parent(s) in a text format, such as:
| userID | parentID | rootID | name | phoneNo | hierarchyList
| 1 | - | - | John | +800100 |
| 2 | 1 | 1 | Bob | +800101 | -1-
| 3 | 2 | 1 | Jack | +800102 | -1-2-
In this example, hierarchyList columns stores the hierarchical structure of each user to their parent until their root parent. To obtain all the children of a user with userID = X, we simply have to find all users which contains the string X in their hierarchyList. I added leading and trailing “-” characters so we don’t find userID “11” or “121” while searching for userID “1” using LIKE statement.
SELECT * FROM users WHERE hierarchyList LIKE '%-X-%';
This is a fulltext search and to improve search times we can add a fulltext index on the column hierarchyList.
The downside of this approach is that every time we update the downline tree structure (such as deleting a user), we also have to update the hierarchyList of all the children of the user we’re updating.
Apparently, there are lots of other ways of storing hierarchical data in SQL. They’re discussed pretty well in this Stackoverflow question I found so if you are just about to design a database for similar purposes, you can check them out and decide which structure is best suited for your needs.
Closing Words
I think this is an interesting problem with quite a lot of interesting solutions that made me look deeper into the features of each RDBMS and also the creative ways people have devised to store a data structure (a tree) which an SQL database is not suited for to store.
Besides considering the pros and cons of each solution, of course we also have to consider the capabilities of the RDBMS we are using. I had to solve this problem in MySQL 5.x and thus CTEs are not an options for me. Stored procedures was also crossed off the list since most of the other engineers agreed that maintaining and debugging a stored procedure is more painful than it’s worth. In the end I used a query with repeated self joins to solve it as that was the solution with the least drawbacks and effort in my context.
References:
https://developpaper.com/implementation-of-sql-recursive-query-in-different-databases/