A User-Defined Function returns more rows

Aveek Das
Aveek Dashttps://datacloudmag.com
Data Engineer, Cloud Data Architect, Thinker, Traveler, Amateur Photographer. Enjoys short walks for hot chocolates. Blogs @ https://datacloudmag.com


Recently, while working with one of my clients, I came across a script in which I had to introduce a scalar User-Defined Function in the SELECT statement. The SELECT used to return around 750K records. However, after introducing the UDF, the row count significantly increased to 826K. There were no other changes made to the script . The UDF was designed to return only a single row for the parameters passed, despite having multiple joins inside the UDF itself.


After a lot of head-scratching for over two days, I finally found where the issue was. There was a DISTINCT in the SELECT statement, which I did overlook. This was the ultimate culprit and caused this huge bump of the record count. The solution is achieved by removing the DISTINCT and adding a GROUP BY followed by a SUM().


To replicate the issue, I will show a sample example of what the issue was and how did it affect the result set. Let’s consider we have an Orders table with records as below.


The query used in the report was a pretty simple one, which just returned the Products that were ordered.

Report Query

Now, there comes a new business requirement, which wants me to display the TotalPrice along with the Products being sold. An important thing to note here is that TotalPrice is a calculated field (Price * Quantity) that is not directly available in the base table.

In order to implement this, I thought of using a user-defined function that would take Price and Quantity as parameters and give us the TotalPrice back. I decided to create the function as follows.

TotalPrice Function

And then, I modified the SELECT query to include this new UDF.

New Select query

However, as you can see from the previous image, after introducing the UDF, the record count increased from 3 to 4. Since this is a fairly simple example, we can easily see the cause of the issue by having a look over the data. But in my case, it was something with around 60 columns and over 750K records, which was quite difficult to understand by just looking at the data.

Updated SELECT query

In order to return only 3 records like before, we have to modify the statement by removing the DISTINCT and introducing a GROUP BY followed by a SUM on the function. Now, we have only 3 records as before and the requirement is also fulfilled.