How to create a PostgreSQL remove duplicates trigger function
So this is a bit specific, however it is related to some of the work I’m doing on databases and I figured other people could learn from my hours of googling. To start off, let me define the problem so you can get some context.
The problem is that we have internal/external databases or data sources that we are attempting to replicate on our database. The challenge is if you insert a table into your database four times, you end up with a table that has all the data four times. So there are a few ways to approach this problem:
- Replace the entire table every time you update the data. This could work in theory, however it does not play nicely with bigger data sets. Also if your data source ever changes, you loose all the data that it has changed.
- Place all historic data in the database then run a query to insert the last x days and update the data every x days. This works better as it does not loose old data that has changed, however it does not capture new changes to the data after it first being queried.
- Place all historic data in the database then insert the last 30 days and update daily but have a trigger function to remove all duplicates on insert. This allows recent data to be updated while still maintaining old data in the case of a data source change. The problem with this is that it has to query the entire table on every insert command. The bright side is that it works.
- The ideal case would be to use webhooks to have the data source tell the data store whenever an individual piece of data has been added, changed, or deleted. The problem is I haven’t figured out how to work with these yet.
So I’ll be explaining option 3 for now. To start off, how do you even remove duplicates? This query took me a bit to find/figure out, but at it’s core, it is pretty simple.
DELETE FROM job
WHERE key IN
(SELECT key
FROM
(SELECT key,
ROW_NUMBER() OVER( PARTITION BY transactiondate,
departmentnumber,jobcode,activitynumber
ORDER BY key ) AS row_num
FROM job ) t
WHERE t.row_num > 1 );
So this query is running on my table called job where key is the unique ID. The transactiondate, departmentnumber, jobcode, and activity number are columns in the table job.
How this piece of code works is it is saying Delete anything from job where two keys don’t match, but the data from columns matches. It does this by partitioning the table into groups by the column data where it sorts by the key and if all the data matches and the are more than one results with different keys, it removes the higher number key. Pretty neat script if you ask me. Also if you change the DELETE FROM job
to a SELECT * FROM job
then you can see all the duplicates.
Now for the fun part! This query is great, however it’s not going to run on it’s own. Welcome to the world of tigger functions in pgsql. Trigger functions are exactly what they are called: functions that get triggered by something. They are broken into two commands: the function and the trigger. First, the function:
CREATE OR REPLACE FUNCTION remove_dup()
RETURNS trigger AS
$$
BEGIN
DELETE FROM job
WHERE key IN
(SELECT key
FROM
(SELECT key,
ROW_NUMBER() OVER( PARTITION BY transactiondate,
departmentnumber,jobcode,activitynumber
ORDER BY key ) AS row_num
FROM job ) t
WHERE t.row_num > 1 );
RETURN NULL;END;$$
LANGUAGE 'plpgsql';
Yes, I know it’s scary looking but hang with me. See that bit in the middle after BEGIN
and before RETURN NULL;
that is the query from above. The rest is just a wrapper of settings depending on what you are doing with your trigger function. Basically, trigger functions normally are used to make some change to the incoming data or log an update on a table so when you see the table has been changed. You can find plenty of these examples from googling pgsql trigger functions. The important part of this script is that the function returns a tigger as a function that returns null. Basically the query gets triggered by a function and doesn’t say anything back to the trigger.
Now for the trigger:
CREATE TRIGGER job_table_change
AFTER INSERT
ON job
FOR EACH STATEMENT
EXECUTE PROCEDURE remove_dup();
This part is pretty easy to understand. I’m creating a trigger called job_table_change and it gets triggered after an insert on job. As for the FOR EACH STATEMENT
this is an option in trigger functions where you can say for each statement or for each row. In this option, I’m using statement because I want it to run after the entire data set has been entered then remove all duplicates in one pass. The next step after defining when it should run is what it runs which is EXECUTE PROCEDURE remove_dup();
this execute command runs the function we described earlier.
The great part about how this code is set up is that you can replace the remove duplicates function with any function you wish to run after an insert of data has occurred! That opens all kinds of possibilities with your data. Have fun and don’t drop that table!