Today I learned, I can create functions in SQL and use them over and over for repetitive tasks while building or developing pipelines. I’ve been working with AWS Redshift for the past few months. Redshift is a very powerful and cost-effective cloud data warehousing solution provided by Amazon. It has its own Redshift SQL dialect which is a variant of PostgreSQL.
What are User-defined functions?
Redshift offers various inbuilt functions like sysdate
or coalesce
which can be executed whenever required. As the name suggestes, User-defined functions are the functions defined by a user to extend the inbuilt capabilities of RedshiftSQL. The defined functions are stored inside the database and can be used by any user with sufficient permissions and credentials.
Language support for UDFs
UDFs in general can be created using SQL select
statements. Redshift UDFs can also be created using Python functions. While UDFs written in SQL are better at performance and less prone to errors, Python UDF support comes with advantage of inbuilt libraries. In addition to Python STL, modules like pandas, numpy, scipy and dateutil are part of RedshiftSQL.
Redshift also supports AWS Lambda UDFs which are custom functions defined in Lambda. Sometimes, there might be some limitations for SQL and python UDFs. This can be handled using Lambda UDFs since these can be created in any language that is supported by AWS Lambda. This includes Go, Java, C#, Node.js etc. So, most of the languages can be integrated into SQL statemnts in Redshift and can be utilized over and over by all the users.
SQL UDF
SQL UDF integrates SQL select
clause that runs whenever it is called and returns a scalar. A function can be created using create function
statement. It has the following parameters - input arguments along with data types, one return argument, a SQL select
clause. A weird point to note here is - the arguments cannot be named inside the UDF select
clause. They must be refered as 2 and so on based on the order of the arguments mentioned in function creation. The select
clause should not include any other clauses like from
, where
, limit
etc. instead these should be used in the SQL statament that calls the UDF.
Structure of SQL UDF
Example - SQL UDF
Example 1:
Example 2:
Now the above UDF can be used as following:
Python UDF
Python UDFs are also similar to SQL UDFs but python code can be used inside a function directly. The language definition for a UDF for using python inside it is PLPYTHONU. Python UDF requires argument names inside the select
clause. Modules like pandas and numpy can also be imported apart from the python standard library. The input and return data types suppoorted by Python UDFs are SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, and TIMESTAMP. For a mapping of Amazon Redshift data types to Python data types, refer to Python UDF datatypes.
Structure of Python UDF
Example - Python UDF
Example 1:
Example 2:
Now the above UDF can be used as following:
UDF Security and Permissions
- To create a UDF, the user must have permission for sql or plpythonu language usage. By default, USAGE ON LANGUAGE SQL is granted to PUBLIC. However, you must explicitly grant USAGE ON LANGUAGE PLPYTHONU to specific users or groups.
- The permissions can be revoked using general SQL
grant
andrevoke
clauses.
- The user also needs to have execute permission to run function created by others. By default, the access is granted to PUBLIC. To restrict usage, the access should be revoked and then the previlige should be granted to the particular users or user group.
Note
- AWS recommends to name all UDFs starting with “f_“. This is not mandatory but it is a good practice. Redshift reserves the f_ prefix exclusively for UDFs and by prefixing the UDF names with f_, the user can ensure that his UDF name will not with any other existing or future functions.
- UDFs can also be created in a particular schema. The schema name can be added while defining UDF from
create function
clause. - UDFs with same name can exist in different schemas.
- UDFs with same name can also exist in same schema if they are provided with different number of arguments and different datatypes.
- For example, two functions
f_average(float, float, float)
andf_average(float, float)
andf_average(decimal, decimal)
can co-exist in same schema. Redshift query engine determines which function to call based on the number of arguments provided and the data types of the arguments. - UDF can be dropped using
drop function f_sample_udf(<arguments>)