Window jobs are a group of jobs that will perform calculations across a set of rows associated with your current class. SQL is considered or is considered advanced and is often asked during data science interviews.
SQL window jobs in data science interviews |
It’s also used in the work quite a lot to solve many different types of problems. Let’s summarize the four different types of window jobs and cover why and when to use them.
4 Types of Window Functions
1. Regular aggregate functions
o These are aggregates like AVG, MIN/MAX, COUNT, SUM
o You’ll want to use these to aggregate your data and group it by another column like month or year.
2. Ranking functions
o ROW_NUMBER, RANK, RANK_DENSE
o These are functions that help you rank your data. You are free to either arrange the entire data set or arrange it by groups such as month or country
o Very useful for creating arrangement indexes within groups.
3. Generating statistics
o These are great if you need to generate simple statistics like NTILE (percentiles, quartiles, medians)
o You can use this for your entire dataset or by the group.
4. Handling time-series data
o Very powerful and popular function, especially if you need to calculate trends like month-to-month moving average or growth gauge
o LAG and LEAD are the two functions that allow you to do this.
1. Regular aggregate function
These regular aggregate functions are functions such as average, count, sum, minimum-maximum that are applied to columns. The goal is to apply the aggregate function if you want to apply aggregations to different groups in the dataset, like a month.
This is a kind of calculation that can be performed with an aggregate function you find in a SELECT statement, but unlike the normal aggregate functions, window functions do not group several rows into one output row, rather they are grouped or kept their identities, depending on how they were found.
Avg() Example:
Now, let’s look at one example of a medium window function () implemented to answer the data analytics question. You can view the question and write code in the link below: platform. stratascratch. com/coding-question?id=10302&python=
This is a perfect example of using a window function and then applying an avg() to a month group. Here we’re trying to calculate the average distance per dollar by the month. This is hard to do in SQL without this window function. Here we’ve applied the avg() window function to the 3rd column where we’ve found the average value for the month-year for every month-year in the dataset. We can also use this metric to calculate the difference between the average month and the average date for each order date in the table.
The code to implement the window function would look like this:
SELECT a.request_date,
a.dist_to_cost,
AVG(a.dist_to_cost) OVER(PARTITION BY a.request_mnth) AS avg_dist_to_cost
FROM
(SELECT *,
to_char(request_date::date, ‘YYYY-MM’) AS request_mnth,
(distance_to_travel/monetary_cost) AS dist_to_cost
FROM uber_request_logs) a
ORDER BY request_date
2. Ranking Functions
Ranking functions are an important utility for a data scientist. You’re always ranking and indexing your data to better understand which rows are the best in your dataset. SQL window functions give you 3 collation utilities – RANK (), DENSE_RANK (), ROW_NUMBER () – depending on the exact use case. These functions will help you to list your data in order and groups based on what you want.
Rank() Example:
Let’s take a look at one ranking window function example to see how we can rank data within groups using SQL window functions. Follow along interactively with this link: platform.stratascratch.com/coding-question?id=9898&python=
Here we want to find the top salaries by the department. We can’t just find the top 3 salaries without a window function because it will just give us the top 3 salaries across all departments, so we need to rank the salaries by department individually. This is done by rank() and partitioned by the department. From there it’s really easy to filter for the top 3 across all departments
Here’s the code to output this table. You can copy and paste in the SQL editor in the link above and see the same output.
SELECT department,
salary,
RANK() OVER (PARTITION BY a.department
ORDER BY a.salary DESC) AS rank_id
FROM
(SELECT department, salary
FROM twitter_employee
GROUP BY department, salary
ORDER BY department, salary) a
ORDER BY department,
salary DESC
3. NTILE
NTILE is a very useful function for those in data analytics, business analytics, and data science. Often when deadlines with statistical data, you probably need to create robust statistics such as quartile, quintile, median, decile in your daily job, and NTILE makes it easy to generate these outputs.
NTILE takes an argument of the number of bins (or basically how many buckets you want to split your data into) and then creates this number of bins by dividing your data into that many numbers of bins. It allows you to specify how the data will be arranged and divided if you want additional groups.
NTILE(100) ExampleIn
In this example, we’ll learn how to use NTILE to categorize our data into percentiles. To follow up interactively at the link here: platform.stratascratch.com/coding-question?id=10303&python=
What you’re trying to do here is identify the top 5 percent of claims based on a score an algorithm outputs. But you can’t just find the top 5% and do order by because you want to find the top 5% by the state. So one way to do this is to use an NTILE() ranking function and then PARTITION by the state. Then you can apply a filter in the WHERE clause to get the top 5%.
Here’s the code to output the entire table above. You can copy and paste it in the link above.
SELECT policy_num,
state,
claim_cost,
fraud_score,
percentile
FROM
(SELECT *,
NTILE(100) OVER(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM fraud_score) a
WHERE percentile <=5
4. Handling time-series data
LAG and LEAD are two window functions that are useful for dealing with time-series data. The only difference between LAG and LEAD is whether you want to grab from previous rows or following rows, almost like sampling from previous data or future data.
You can use LAG and LEAD to calculate month-over-month growth or rolling averages. As a data scientist and business analyst, you’re always dealing with time series data and creating those time metrics.
LAG() Example:
In this example, we want to find the percentage growth year-over-year, which is a very common question that data scientists and business analysts answer daily. The problem statement, data, and SQL editor is in the following link if you want to try to code the solution on your own: platform.stratascratch.com/coding-question?id=9637&python=
What’s hard about this problem is the data is set up — you need to use the previous row’s value in your metric. But SQL isn’t built to do that. SQL is built to calculate anything you want as long as the values are on the same row. So we can use the lag() or lead() window function which will take the previous or subsequent rows and put them in your current row which is what this question is doing.
And now you can get the code to output the entire table above. You can copy and paste the code into the SQL editor at the link above:
Specification year,
Current_year_host,
prev_year_host,
Round (((current_year_host – prev_year_host)) / (cast (prev_year_host AS numeric))) * 100) estimated_growth
from
(Choose year,
Current_year_host,
LAG (current_year_host, 1) OVER (ORDER BY year) AS prev_year_host
from
Select an extract (year
From host_since:: date) AS year,
Current (id) number_year_host
From airbnb_search_details
Where host_since is not empty
GROUP BY extraction (year
From host_since:: date)
Sort by year t1) t2
Window jobs are very useful as a data scientist in your daily job and are often what you are asked in for interviews. Jobs make solving problems where ratings are and calculating growth much easier than if you didn’t have them.