Increasingly architects and DBAs are spending more time working with aggregates. As computational power increases, so does the clients’ desire to analyze their data in a multitude of ways. Even though we have to ability to track every event that occurs, challenges still exist in presenting that data to users in an efficient method.
One common example of this occurs on systems where the tracking is done on the user-level, as would be the case with a large online store or a massive multiplayer game. Such a system could be creating 100M records per day or more. But the users of the warehouse don’t particularly care what one individual user is doing. They are more interested in what thousands or millions of the users are doing. So aggregate-table using the CUBE and ROLLUP aggregates can make life much easier.
Say there is a tracking table generating 100M records per day:
CREATE TABLE event (
event_date DATE,
event_type_id NUMBER,
event_location_id NUMBER,
user_id NUMBER
);
Maybe with a powerful server, lots of CPUs and partitioning, one can get away with letting analysts query the table EVENT directly. But in all likelihood, an aggregate table is going to be needed that could boost performance of the analysts’ SQL by two orders of magnitude or more. First, create an aggregate table for those queries:
CREATE TABLE event_aggregate (
event_date DATE,
event_type_id NUMBER,
event_location_id NUMBER,
total_users NUMBER
);
Then, simple SQL statements can be used to populate the table:
INSERT INTO event_aggregate
(event_date, event_type_id, event_location_id, total_users)
SELECT event_date, event_type_id, event_location_id, COUNT(user_id)
FROM event
GROUP BY event_date, event_type_id, event_location_id;
In this scenario, it is very easy for the analysts to run queries to find out how their system is being used based on EVENT_DATE, EVENT_TYPE_ID or EVENT_LOCATION_ID. They can even aggregate those variables in any manner they choose. However, suppose the analysts care about total users and distinct users, leading to an aggregate table like:
CREATE TABLE event_aggregate_distincts (
event_date DATE,
event_type_id NUMBER,
event_location_id NUMBER,
distinct_users NUMBER,
total_users NUMBER
);
One could still use a simple GROUP BY in this case, but then the analysts could only look at the individual rows in the EVENT_AGGREGATE_DISTINCTS table. If they aggregated again, the DISTINCT_USERS count could be incorrect. Assume the following records exist in the EVENT table:
INSERT INTO event
(event_date, event_type_id, event_location_id, user_id)
VALUES
(’13-JUL-2010’, 1, 10, 300);
INSERT INTO event
(event_date, event_type_id, event_location_id, user_id)
VALUES
(’13-JUL-2010’, 1, 10, 300);
INSERT INTO event
(event_date, event_type_id, event_location_id, user_id)
VALUES
(’13-JUL-2010’, 1, 10, 400);
INSERT INTO event
(event_date, event_type_id, event_location_id, user_id)
VALUES
(’13-JUL-2010’, 2, 10, 300);
INSERT INTO event
(event_date, event_type_id, event_location_id, user_id)
VALUES
(’13-JUL-2010’, 2, 10, 400);
INSERT INTO event
(event_date, event_type_id, event_location_id, user_id)
VALUES
(’13-JUL-2010’, 2, 50, 300);
Now if the EVENT_AGGREGATE_DISTINCT table is populated with the following SQL:
INSERT INTO event_aggregate_distincts
(event_date, event_type_id, event_location_id,
distinct_users, total_users)
SELECT event_date, event_type_id, event_location_id,
COUNT(DISTINCT user_id), COUNT(user_id)
FROM event
GROUP BY event_date, event_type_id, event_location_id;
Results in the following table data:
SELECT * FROM event_aggregate_distincts;
EVENT_DAT EVENT_TYPE_ID EVENT_LOCATION_ID DISTINCT_USERS TOTAL_USERS
--------- ------------- ----------------- -------------- -----------
13-JUL-10 1 10 2 3
13-JUL-10 2 10 2 2
13-JUL-10 2 50 1 1
But if an analyst want to find out how many distinct users there were for EVENT_LOCATION_ID = 10 or for EVENT_TYPE_ID = 2, there would be no way to get this information from the aggregate tables. In this scenario, one can use the CUBE or ROLLUP functions to preserve this information. The ROLLUP function is a cascading aggregate that determines all of the possibilities of the right-most column, then the second right-most column, etc. all the way through all of the rolled-up columns. The CUBE function does the same thing, but it finds all of the possibilities for all of the cubed columns. Two examples:
INSERT INTO event_aggregate_distincts
(event_date, event_type_id, event_location_id,
distinct_users, total_users)
SELECT event_date, event_type_id, event_location_id,
COUNT(DISTINCT user_id), COUNT(user_id)
FROM event
GROUP BY event_date, ROLLUP(event_type_id, event_location_id)SELECT * FROM event_aggregate_distincts;
EVENT_DAT EVENT_TYPE_ID EVENT_LOCATION_ID DISTINCT_USERS TOTAL_USERS
13-JUL-10 1 10 2 3
13-JUL-10 2 10 2 2
13-JUL-10 2 50 1 1
13-JUL-10 1 2 3
13-JUL-10 2 2 3
13-JUL-10 2 6
With the ROLLUP function, the aggregate was first done for all combinations of EVENT_TYPE_ID and EVENT_LOCATION_ID. Then EVENT_LOCATION_ID was disregarded, and the aggregate was done for all combinations of EVENT_TYPE_ID. Then both were disregarded, and the aggregate was done overall.
INSERT INTO event_aggregate_distincts
(event_date, event_type_id, event_location_id,
distinct_users, total_users)
SELECT event_date, event_type_id, event_location_id,
COUNT(DISTINCT user_id), COUNT(user_id)
FROM event
GROUP BY event_date, CUBE(event_type_id, event_location_id);
SELECT * FROM event_aggregate_distincts;
EVENT_DAT EVENT_TYPE_ID EVENT_LOCATION_ID DISTINCT_USERS TOTAL_USERS
--------- ------------- ----------------- -------------- -----------
13-JUL-10 1 10 2 3
13-JUL-10 2 10 2 2
13-JUL-10 2 50 1 1
13-JUL-10 1 2 3
13-JUL-10 2 2 3
13-JUL-10 10 2 5
13-JUL-10 50 1 1
13-JUL-10 2 6
With the CUBE function, the results are similar to the ROLLUP. However, the aggregate was also done for all combinations of EVENT_LOCATION_ID while disregarding the EVENT_TYPE_ID. So the difference between the two is that ROLLUP does all permutations of the columns from right to left while CUBE does all the possible column permutations. Note that by leaving EVENT_DATE out of the CUBE and ROLLUP functions, the aggregate was just done normally on that column.