Find MEDIAN of a column in SQL (Hackerrank SQL: Weather Observation Station 20)
A median is defined as a number separating the higher half of a data set from the lower half. Query the median of the Northern Latitudes (LAT_N) from STATION and round your answer to 4 decimal places.
Question Source: Hackerrank SQL: Weather Observation Station 20
Input Format
The STATION table is described as follows:
where LAT_N is the northern latitude and LONG_W is the western longitude.
We can solve this in two different way,
Approach A:
For MYSQL 8.0 we can use window function ROW_NUMBER()
We first set a row_number for each row based on column lat_n (order by lat_n)
SELECT lat_n, ROW_NUMBER() OVER (ORDER BY lat_n ASC) AS rn
FROM station
The output of the above query looks like below. The query output has mapped each row with a rank number based on lat_n. We will use the query output as a temporary virtual table and do search on that.
In next step, we find MEDIAN row (the row in the middle of sorted data).
Since the above query output can be considered as a sorted temporary table, we do the following search
SELECT
*
FROM
(
SELECT lat_n, ROW_NUMBER() OVER (ORDER BY lat_n) AS rn
FROM station
) AS subq
WHERE
rn = (SELECT CEIL((COUNT(rn)+1)/2) FROM station)
OR
rn = (SELECT FLOOR((COUNT(rn)+1)/2) FROM station)
The above query outputs -
Now, we got the middle row(s).
Finally, we can return the average lat_n value.
SELECT
ROUND(AVG(lat_n), 4) AS median
FROM
(
SELECT lat_n, ROW_NUMBER() OVER (ORDER BY lat_n) AS rn
FROM station
) AS subq
WHERE
rn = (SELECT CEIL((COUNT(rn)+1)/2) FROM station)
OR
rn = (SELECT FLOOR((COUNT(rn)+1)/2) FROM station)
The final output is
Now, Can you think why AVG
was necessary?
Approach B:
In older mysql versions, row_number() isn’t available. However, we can simply generate it by looping through all rows and using a SESSION VARIABLE
SET @rn := -1;
SELECT lat_n AS med, (@rn := @rn + 1) AS rn
FROM station
ORDER BY lat_n
The above query yields the following 0 based ranking
We can now use the output of the query as a searchable temporary table.
SET @rn := -1;
SELECT ROUND(AVG(subq.med), 4) AS median
FROM
(
SELECT lat_n AS med, (@rn := @rn + 1) AS rn
FROM station
ORDER BY lat_n
) AS subq
WHERE
subq.rn IN (FLOOR(@rn / 2), CEIL(@rn / 2));
Can you think how can we modify the query to use 1 based ranking instead of 0 based ranking above?