SQL Mastery Guide - Part 3: Aggregations and GROUP BY

Introduction

Welcome to Part 3 of our SQL Mastery series. In Parts 1 and 2, we covered SELECT fundamentals and all types of JOINs. Now we tackle one of SQL's most powerful features: aggregations.

Aggregations transform rows into meaningful summaries. Instead of seeing 10,000 orders, you see "total revenue by month." Instead of listing every employee, you see "average salary by department."

This part covers Questions 51-75, taking you from basic COUNT to advanced window functions.

Schema Reference

We continue with our e-commerce database, adding some new tables:

sql
-- Core Tables (from Part 1)
CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    full_name VARCHAR(100),
    created_at TIMESTAMP DEFAULT NOW(),
    country VARCHAR(50),
    is_active BOOLEAN DEFAULT true
);

CREATE TABLE categories (
    category_id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    parent_category_id INT REFERENCES categories(category_id)
);

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    price DECIMAL(10,2),
    category_id INT REFERENCES categories(category_id),
    stock_quantity INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    order_date TIMESTAMP DEFAULT NOW(),
    status VARCHAR(20) DEFAULT 'pending',
    total_amount DECIMAL(10,2),
    shipping_address TEXT
);

CREATE TABLE order_items (
    item_id SERIAL PRIMARY KEY,
    order_id INT REFERENCES orders(order_id),
    product_id INT REFERENCES products(product_id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

-- New Tables for Aggregation Practice
CREATE TABLE page_views (
    view_id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    page_url VARCHAR(500),
    viewed_at TIMESTAMP DEFAULT NOW(),
    duration_seconds INT
);

CREATE TABLE reviews (
    review_id SERIAL PRIMARY KEY,
    product_id INT REFERENCES products(product_id),
    user_id INT REFERENCES users(user_id),
    rating INT CHECK (rating BETWEEN 1 AND 5),
    review_text TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE employee_sales (
    sale_id SERIAL PRIMARY KEY,
    employee_id INT,
    sale_date DATE,
    amount DECIMAL(10,2),
    region VARCHAR(50)
);

Question 51: Basic COUNT - Counting All Rows

Problem

Count the total number of users in our system.

Solution

sql
SELECT COUNT(*) AS total_users
FROM users;

Explanation

COUNT(*) counts every row in the table, regardless of NULL values. It's the simplest aggregation.

How it works internally:

Database scans the table (or uses an index)
Increments counter for each row
Returns final count

Common Mistakes

Mistake 1: Using COUNT(column) when you want all rows

sql
-- This might give wrong count if email has NULLs
SELECT COUNT(email) FROM users;

-- This always counts all rows
SELECT COUNT(*) FROM users;

Mistake 2: Thinking COUNT(*) is slow

sql
-- Many people write this thinking it's faster
SELECT COUNT(1) FROM users;

-- Modern databases optimize COUNT(*) the same way
SELECT COUNT(*) FROM users;

Performance Considerations

PostgreSQL: COUNT(*) scans entire table by default (MVCC reasons)
MySQL InnoDB: Also needs table scan
MySQL MyISAM: Keeps row count cached (instant result)

For large tables, consider:

sql
-- Approximate count (much faster)
SELECT reltuples::BIGINT AS approximate_count
FROM pg_class
WHERE relname = 'users';

When to Use

Getting total record counts
Dashboard statistics
Validation checks

When NOT to Use

When you need exact count on billion-row tables (consider approximations)
When you only need to know if rows exist (use EXISTS instead)

Question 52: COUNT with DISTINCT

Problem

Count how many unique countries our users are from.

Solution

sql
SELECT COUNT(DISTINCT country) AS unique_countries
FROM users;

Explanation

COUNT(DISTINCT column) counts only unique non-NULL values.

Step-by-step:

Scan country column
Build hash set of unique values
Ignore NULLs
Return set size

Common Mistakes

Mistake 1: Forgetting NULLs aren't counted

sql
-- If 100 users have country, 10 have NULL
-- This returns count of non-NULL unique countries
SELECT COUNT(DISTINCT country) FROM users;  -- Maybe 25

-- To include NULL as a country
SELECT COUNT(DISTINCT country) +
       CASE WHEN EXISTS(SELECT 1 FROM users WHERE country IS NULL)
            THEN 1 ELSE 0 END
FROM users;

Mistake 2: Using DISTINCT on primary keys

sql
-- Pointless - every user_id is already unique
SELECT COUNT(DISTINCT user_id) FROM users;

-- Same result, more efficient
SELECT COUNT(*) FROM users;

Performance Considerations

sql
-- EXPLAIN ANALYZE this
SELECT COUNT(DISTINCT country) FROM users;

-- If slow, consider:
-- 1. Index on country column
CREATE INDEX idx_users_country ON users(country);

-- 2. Materialized view for dashboards
CREATE MATERIALIZED VIEW country_stats AS
SELECT COUNT(DISTINCT country) AS unique_countries
FROM users;

When to Use

Counting unique visitors, unique products sold, unique categories
De-duplication counts

Question 53: SUM - Calculating Totals

Problem

Calculate the total revenue from all orders.

Solution

sql
SELECT SUM(total_amount) AS total_revenue
FROM orders
WHERE status = 'completed';

Explanation

SUM(column) adds up all non-NULL values in the column.

Important behaviors:

NULL values are ignored
Returns NULL if all values are NULL
Returns NULL on empty result set

Common Mistakes

Mistake 1: Not handling NULLs

sql
-- If some orders have NULL total_amount
SELECT SUM(total_amount) FROM orders;  -- Might miss some orders

-- Better: treat NULL as 0
SELECT SUM(COALESCE(total_amount, 0)) FROM orders;

Mistake 2: Integer overflow

sql
-- If amount is INTEGER and sum exceeds 2.1 billion
SELECT SUM(amount) FROM huge_sales_table;  -- Overflow error!

-- Cast to BIGINT
SELECT SUM(amount::BIGINT) FROM huge_sales_table;

Mistake 3: Precision loss with floats

sql
-- FLOAT can lose precision
SELECT SUM(price) FROM products;  -- 999.9999999997

-- Use DECIMAL for money
ALTER TABLE products ALTER COLUMN price TYPE DECIMAL(10,2);

Performance Considerations

sql
-- Large table SUM can be slow
EXPLAIN ANALYZE SELECT SUM(total_amount) FROM orders;

-- Covering index helps
CREATE INDEX idx_orders_sum ON orders(status) INCLUDE (total_amount);

-- For real-time dashboards, use materialized views
CREATE MATERIALIZED VIEW revenue_summary AS
SELECT
    DATE_TRUNC('day', order_date) AS day,
    SUM(total_amount) AS daily_revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('day', order_date);

Question 54: AVG - Calculating Averages

Problem

Find the average order value for completed orders.

Solution

sql
SELECT AVG(total_amount) AS average_order_value
FROM orders
WHERE status = 'completed';

Explanation

AVG(column) = SUM(column) / COUNT(column)

NULL values are excluded from both numerator and denominator.

Common Mistakes

Mistake 1: Confusing AVG behavior with NULLs

sql
-- Table: scores(student_id, score)
-- Data: (1, 100), (2, 80), (3, NULL)

SELECT AVG(score) FROM scores;
-- Returns 90, not 60! (180/2, not 180/3)

-- If you want NULL treated as 0:
SELECT AVG(COALESCE(score, 0)) FROM scores;
-- Returns 60 (180/3)

Mistake 2: Not rounding for display

sql
-- Raw average has many decimal places
SELECT AVG(total_amount) FROM orders;  -- 125.7834921...

-- Round appropriately
SELECT ROUND(AVG(total_amount), 2) AS avg_order FROM orders;  -- 125.78

Performance Considerations

Same as SUM - uses same scan, benefits from indexes.

When to Use

Calculating average order value, average session duration
Computing mean metrics

When NOT to Use

When you need median (for skewed data):

sql
-- Average can be misleading with outliers
-- Order values: 50, 50, 50, 50, 10000
-- AVG = 2040 (not representative!)

-- Use PERCENTILE_CONT for median
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) AS median_order
FROM orders;
-- Median = 50 (more representative)

Question 55: MIN and MAX

Problem

Find the cheapest and most expensive products.

Solution

sql
SELECT
    MIN(price) AS cheapest_price,
    MAX(price) AS most_expensive_price
FROM products;

Explanation

MIN(column): Returns smallest value
MAX(column): Returns largest value
Both ignore NULLs

Common Mistakes

Mistake 1: Using MIN/MAX on wrong data types

sql
-- String comparison is alphabetical, not numeric!
SELECT MAX(order_id) FROM orders;  -- If order_id is VARCHAR
-- '9' > '10' alphabetically!

-- Ensure numeric types for numeric comparisons

Mistake 2: Trying to get the row with MIN/MAX

sql
-- This doesn't work as expected
SELECT product_id, MIN(price) FROM products;  -- Error or wrong!

-- Correct approaches:
-- Option 1: Subquery
SELECT * FROM products
WHERE price = (SELECT MIN(price) FROM products);

-- Option 2: ORDER BY with LIMIT
SELECT * FROM products ORDER BY price ASC LIMIT 1;

-- Option 3: Window function
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (ORDER BY price ASC) AS rn
    FROM products
) sub WHERE rn = 1;

Performance Considerations

sql
-- MIN/MAX can use B-tree index efficiently
CREATE INDEX idx_products_price ON products(price);

-- With index, MIN/MAX is O(1) - just read first/last leaf
EXPLAIN SELECT MIN(price) FROM products;
-- Shows: Index Only Scan

When to Use

Finding bounds (earliest date, latest login)
Range queries
Data validation

Question 56: GROUP BY Basics

Problem

Count orders per user.

Solution

sql
SELECT
    user_id,
    COUNT(*) AS order_count
FROM orders
GROUP BY user_id
ORDER BY order_count DESC;

Explanation

GROUP BY creates buckets of rows with same values, then applies aggregations to each bucket.

How it works:

Scan table
Hash rows by GROUP BY columns
Apply aggregation to each group
Return one row per group

Before GROUP BY:
user_id | order_id | total
1       | 101      | 50
1       | 102      | 75
2       | 103      | 30
1       | 104      | 25

After GROUP BY user_id:
user_id | COUNT(*) | SUM(total)
1       | 3        | 150
2       | 1        | 30

Common Mistakes

Mistake 1: Selecting non-grouped columns

sql
-- ERROR in standard SQL!
SELECT user_id, order_date, COUNT(*)  -- order_date not in GROUP BY
FROM orders
GROUP BY user_id;

-- Which order_date should it show? Undefined!

-- Fixed:
SELECT user_id, MAX(order_date) AS latest_order, COUNT(*)
FROM orders
GROUP BY user_id;

Mistake 2: Misunderstanding GROUP BY with NULLs

sql
-- NULLs are grouped together
SELECT country, COUNT(*)
FROM users
GROUP BY country;
-- Returns a row for NULL country with count of NULL users

Performance Considerations

sql
-- GROUP BY can be expensive on large tables
EXPLAIN ANALYZE
SELECT user_id, COUNT(*)
FROM orders
GROUP BY user_id;

-- Hash Aggregate vs Sort Aggregate
-- Database chooses based on:
-- - work_mem setting
-- - Number of groups
-- - Available indexes

-- Index can help if sorting is chosen
CREATE INDEX idx_orders_user ON orders(user_id);

When to Use

Any time you need per-group statistics
Reporting and analytics
Data summarization

Question 57: GROUP BY Multiple Columns

Problem

Calculate total sales per employee per region.

Solution

sql
SELECT
    employee_id,
    region,
    SUM(amount) AS total_sales,
    COUNT(*) AS sale_count
FROM employee_sales
GROUP BY employee_id, region
ORDER BY employee_id, region;

Explanation

With multiple columns, GROUP BY creates groups for each unique combination.

employee_id | region | amount
1           | North  | 100
1           | North  | 150
1           | South  | 200
2           | North  | 300

Groups:
(1, North) -> SUM: 250, COUNT: 2
(1, South) -> SUM: 200, COUNT: 1
(2, North) -> SUM: 300, COUNT: 1

Common Mistakes

Mistake 1: Wrong granularity

sql
-- Too many columns = very small groups
SELECT employee_id, region, sale_date, SUM(amount)
FROM employee_sales
GROUP BY employee_id, region, sale_date;
-- Might give one row per sale if dates are unique!

-- Consider date truncation
SELECT employee_id, region, DATE_TRUNC('month', sale_date), SUM(amount)
FROM employee_sales
GROUP BY employee_id, region, DATE_TRUNC('month', sale_date);

Performance Considerations

sql
-- Multi-column GROUP BY benefits from composite index
CREATE INDEX idx_sales_emp_region ON employee_sales(employee_id, region);

-- Order matters! Index on (employee_id, region) helps:
GROUP BY employee_id, region  -- Yes
GROUP BY employee_id          -- Yes
GROUP BY region               -- No (can't skip columns)

Question 58: HAVING - Filtering Groups

Problem

Find users who have placed more than 5 orders.

Solution

sql
SELECT
    user_id,
    COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5
ORDER BY order_count DESC;

Explanation

HAVING filters groups AFTER aggregation. WHERE filters rows BEFORE aggregation.

Execution Order:
1. FROM orders
2. WHERE (filter rows)
3. GROUP BY user_id
4. HAVING COUNT(*) > 5 (filter groups)
5. SELECT
6. ORDER BY

Common Mistakes

Mistake 1: Using WHERE for aggregate conditions

sql
-- ERROR!
SELECT user_id, COUNT(*)
FROM orders
WHERE COUNT(*) > 5  -- Can't use aggregate in WHERE
GROUP BY user_id;

-- Correct: use HAVING
SELECT user_id, COUNT(*)
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;

Mistake 2: Using HAVING when WHERE would work

sql
-- Inefficient: filters after grouping
SELECT user_id, COUNT(*)
FROM orders
GROUP BY user_id
HAVING user_id > 100;

-- Better: filter before grouping (fewer rows to group)
SELECT user_id, COUNT(*)
FROM orders
WHERE user_id > 100
GROUP BY user_id;

Performance Considerations

sql
-- HAVING is applied after GROUP BY
-- So all rows are still processed

-- If possible, use WHERE to reduce rows BEFORE grouping
-- WHERE filters: O(n) scan, then group smaller set
-- HAVING filters: O(n) scan, group all, then filter

EXPLAIN ANALYZE
SELECT user_id, COUNT(*)
FROM orders
WHERE order_date > '2024-01-01'  -- Reduces rows first
GROUP BY user_id
HAVING COUNT(*) > 5;             -- Filters groups after

Question 59: Combining WHERE and HAVING

Problem

Find products with average rating above 4, considering only reviews from 2024.

Solution

sql
SELECT
    p.product_id,
    p.name,
    AVG(r.rating) AS avg_rating,
    COUNT(r.review_id) AS review_count
FROM products p
JOIN reviews r ON p.product_id = r.product_id
WHERE r.created_at >= '2024-01-01'      -- Filter rows
GROUP BY p.product_id, p.name
HAVING AVG(r.rating) > 4                -- Filter groups
   AND COUNT(r.review_id) >= 5          -- Minimum reviews
ORDER BY avg_rating DESC;

Explanation

This query demonstrates the full flow:

JOIN products and reviews
WHERE filters to 2024 reviews only
GROUP BY product
HAVING filters to high-rated products with enough reviews
ORDER BY rating

Common Mistakes

Mistake 1: Putting HAVING conditions in WHERE

sql
-- Wrong order causes errors
SELECT product_id, AVG(rating)
FROM reviews
WHERE AVG(rating) > 4          -- ERROR!
GROUP BY product_id;

Mistake 2: Not understanding NULL in HAVING

sql
-- Products with no reviews have NULL avg
-- HAVING AVG(rating) > 4 excludes them automatically
-- (NULL > 4 is UNKNOWN, not TRUE)

Question 60: Grouping by Expressions

Problem

Calculate total revenue by month.

Solution

sql
SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS monthly_revenue,
    COUNT(*) AS order_count
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

Explanation

You can GROUP BY any expression, not just columns. The database evaluates the expression for each row, then groups by result.

Common Mistakes

Mistake 1: Expression mismatch

sql
-- SELECT and GROUP BY expressions must match exactly
SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*)
FROM orders
GROUP BY month;  -- Some databases don't allow alias reference

-- Safer: repeat expression
SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*)
FROM orders
GROUP BY DATE_TRUNC('month', order_date);

-- Or use position
SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*)
FROM orders
GROUP BY 1;  -- Group by first SELECT expression

Performance Considerations

sql
-- Expression in GROUP BY can't use regular index
-- Computed during scan

-- Create expression index for better performance
CREATE INDEX idx_orders_month ON orders(DATE_TRUNC('month', order_date));

-- Now the GROUP BY can use index
EXPLAIN ANALYZE
SELECT DATE_TRUNC('month', order_date), COUNT(*)
FROM orders
GROUP BY DATE_TRUNC('month', order_date);

Question 61: ROLLUP - Subtotals and Grand Totals

Problem

Show sales by region with subtotals and grand total.

Solution

sql
SELECT
    COALESCE(region, 'TOTAL') AS region,
    SUM(amount) AS total_sales,
    COUNT(*) AS sale_count
FROM employee_sales
GROUP BY ROLLUP(region)
ORDER BY
    CASE WHEN region IS NULL THEN 1 ELSE 0 END,
    region;

Explanation

ROLLUP creates grouping sets that include subtotals:

GROUP BY ROLLUP(region) produces:
- Groups for each region
- Grand total (all regions combined)

Result:
region | total_sales | sale_count
North  | 50000       | 120
South  | 45000       | 100
West   | 60000       | 150
TOTAL  | 155000      | 370

For multiple columns:

sql
GROUP BY ROLLUP(region, employee_id)
-- Produces:
-- (region, employee_id) - per employee per region
-- (region) - subtotal per region
-- () - grand total

Common Mistakes

Mistake 1: Not distinguishing NULL data from subtotal rows

sql
-- If region can be NULL in data, how to tell NULL region from subtotal?
-- Use GROUPING() function
SELECT
    region,
    SUM(amount),
    GROUPING(region) AS is_subtotal  -- 1 if subtotal row, 0 if actual data
FROM employee_sales
GROUP BY ROLLUP(region);

Performance Considerations

ROLLUP adds extra grouping operations but is more efficient than UNION ALL of separate queries.

Question 62: CUBE - All Combinations

Problem

Analyze sales by both region and quarter with all subtotals.

Solution

sql
SELECT
    COALESCE(region, 'All Regions') AS region,
    COALESCE(DATE_TRUNC('quarter', sale_date)::TEXT, 'All Quarters') AS quarter,
    SUM(amount) AS total_sales
FROM employee_sales
GROUP BY CUBE(region, DATE_TRUNC('quarter', sale_date))
ORDER BY region, quarter;

Explanation

CUBE creates all possible combinations:

GROUP BY CUBE(region, quarter)
produces:
- (region, quarter) - per region per quarter
- (region) - per region, all quarters
- (quarter) - per quarter, all regions
- () - grand total

More combinations than ROLLUP but gives complete dimensional analysis.

When to Use

OLAP-style reporting
Creating pivot-table-like summaries
When you need all possible subtotals

Question 63: GROUPING SETS - Custom Groups

Problem

Get sales totals by region, by employee, and overall (but not region+employee combination).

Solution

sql
SELECT
    region,
    employee_id,
    SUM(amount) AS total_sales
FROM employee_sales
GROUP BY GROUPING SETS (
    (region),
    (employee_id),
    ()
)
ORDER BY region NULLS LAST, employee_id NULLS LAST;

Explanation

GROUPING SETS lets you specify exactly which groupings you want:

More flexible than ROLLUP/CUBE
Avoids computing unwanted combinations

Equivalent to:
SELECT region, NULL, SUM(amount) FROM employee_sales GROUP BY region
UNION ALL
SELECT NULL, employee_id, SUM(amount) FROM employee_sales GROUP BY employee_id
UNION ALL
SELECT NULL, NULL, SUM(amount) FROM employee_sales

But more efficient (single scan).

Question 64: Window Functions Introduction - ROW_NUMBER

Problem

Rank users by their total order value.

Solution

sql
SELECT
    u.user_id,
    u.full_name,
    SUM(o.total_amount) AS total_spent,
    ROW_NUMBER() OVER (ORDER BY SUM(o.total_amount) DESC) AS rank
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'completed'
GROUP BY u.user_id, u.full_name
ORDER BY rank;

Explanation

Window functions perform calculations across related rows WITHOUT collapsing them into one row like GROUP BY.

ROW_NUMBER() assigns sequential numbers:

user_id | total_spent | rank
5       | 10000       | 1
3       | 8500        | 2
8       | 8500        | 3    -- Different rank despite same amount
1       | 7200        | 4

Syntax:

function() OVER (ORDER BY ...)

Common Mistakes

Mistake 1: Confusing with GROUP BY

sql
-- This is NOT window function (collapses rows)
SELECT user_id, SUM(total_amount) FROM orders GROUP BY user_id;

-- This IS window function (keeps all rows)
SELECT
    order_id,
    user_id,
    total_amount,
    SUM(total_amount) OVER (PARTITION BY user_id) AS user_total
FROM orders;

Mistake 2: Using ROW_NUMBER for ties

sql
-- ROW_NUMBER gives different ranks for same values
-- Use RANK() or DENSE_RANK() if ties should match

Question 65: RANK and DENSE_RANK

Problem

Rank products by sales, handling ties appropriately.

Solution

sql
WITH product_sales AS (
    SELECT
        p.product_id,
        p.name,
        SUM(oi.quantity) AS units_sold
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id, p.name
)
SELECT
    product_id,
    name,
    units_sold,
    ROW_NUMBER() OVER (ORDER BY units_sold DESC) AS row_num,
    RANK() OVER (ORDER BY units_sold DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY units_sold DESC) AS dense_rank
FROM product_sales;

Explanation

Compare the three ranking functions:

units_sold | ROW_NUMBER | RANK | DENSE_RANK
100        | 1          | 1    | 1
95         | 2          | 2    | 2
95         | 3          | 2    | 2     -- Tie!
90         | 4          | 4    | 3     -- Key difference

ROW_NUMBER: Always unique, arbitrary order for ties
RANK: Same rank for ties, gaps after
DENSE_RANK: Same rank for ties, no gaps

When to Use Which

ROW_NUMBER: Pagination, unique ordering needed
RANK: Competition ranking (1st, 2nd, 2nd, 4th)
DENSE_RANK: When you need sequential values (1st, 2nd, 2nd, 3rd)

Question 66: PARTITION BY - Groups Within Window

Problem

Rank products within each category by price.

Solution

sql
SELECT
    category_id,
    product_id,
    name,
    price,
    RANK() OVER (
        PARTITION BY category_id
        ORDER BY price DESC
    ) AS price_rank_in_category
FROM products
WHERE price IS NOT NULL
ORDER BY category_id, price_rank_in_category;

Explanation

PARTITION BY divides rows into groups, and window function restarts for each group.

category_id | product | price | rank
1           | Laptop  | 1500  | 1
1           | Tablet  | 800   | 2
1           | Mouse   | 50    | 3
2           | Shirt   | 100   | 1     -- Rank restarts!
2           | Hat     | 30    | 2

Without PARTITION BY, ranking would be across ALL products.

Common Mistakes

Mistake 1: Confusing PARTITION BY with GROUP BY

sql
-- GROUP BY: Collapses rows, one output per group
SELECT category_id, MAX(price) FROM products GROUP BY category_id;

-- PARTITION BY: Keeps all rows, computes value for each
SELECT category_id, product_id, price,
       MAX(price) OVER (PARTITION BY category_id) AS max_in_category
FROM products;

Question 67: Running Totals with SUM OVER

Problem

Calculate running total of daily sales.

Solution

sql
SELECT
    sale_date,
    SUM(amount) AS daily_total,
    SUM(SUM(amount)) OVER (ORDER BY sale_date) AS running_total
FROM employee_sales
GROUP BY sale_date
ORDER BY sale_date;

Explanation

SUM() OVER (ORDER BY ...) creates a running total:

sale_date  | daily_total | running_total
2024-01-01 | 1000        | 1000
2024-01-02 | 1500        | 2500
2024-01-03 | 800         | 3300
2024-01-04 | 2000        | 5300

The ORDER BY in OVER defines the running order. Each row includes sum of all previous rows.

Common Mistakes

Mistake 1: Not understanding default frame

sql
-- With ORDER BY, default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- This means: sum from start to current row

-- Without ORDER BY, default is all rows in partition
SELECT SUM(amount) OVER () FROM sales;  -- Same value for every row (grand total)

Performance Considerations

sql
-- Running totals are efficient - one pass through data
-- But can still be slow on huge datasets

-- For very large tables, consider pre-computing:
CREATE TABLE daily_running_totals AS
SELECT
    sale_date,
    SUM(amount) AS daily_total,
    SUM(SUM(amount)) OVER (ORDER BY sale_date) AS running_total
FROM employee_sales
GROUP BY sale_date;

Question 68: Moving Averages with Window Frames

Problem

Calculate 7-day moving average of daily page views.

Solution

sql
WITH daily_views AS (
    SELECT
        DATE(viewed_at) AS view_date,
        COUNT(*) AS daily_count
    FROM page_views
    GROUP BY DATE(viewed_at)
)
SELECT
    view_date,
    daily_count,
    ROUND(AVG(daily_count) OVER (
        ORDER BY view_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) AS moving_avg_7day
FROM daily_views
ORDER BY view_date;

Explanation

Window frames define exactly which rows to include in calculation:

sql
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
-- Current row + 6 previous = 7 rows total

-- Other frame options:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW  -- All previous
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING          -- 7-row centered
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING  -- Current to end

Common Mistakes

Mistake 1: ROWS vs RANGE

sql
-- ROWS: Exact number of rows
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

-- RANGE: Value-based (can include multiple rows with same value)
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW

-- For dates with gaps, RANGE might be more appropriate

Mistake 2: Incomplete windows at edges

sql
-- First 6 rows won't have full 7-day average
-- Handle appropriately:
SELECT
    view_date,
    daily_count,
    CASE
        WHEN COUNT(*) OVER (ORDER BY view_date ROWS 6 PRECEDING) < 7
        THEN NULL  -- Not enough data
        ELSE AVG(daily_count) OVER (ORDER BY view_date ROWS 6 PRECEDING)
    END AS moving_avg_7day
FROM daily_views;

Question 69: LAG and LEAD - Accessing Other Rows

Problem

Compare each day's sales to the previous day.

Solution

sql
WITH daily_sales AS (
    SELECT
        sale_date,
        SUM(amount) AS daily_total
    FROM employee_sales
    GROUP BY sale_date
)
SELECT
    sale_date,
    daily_total,
    LAG(daily_total, 1, 0) OVER (ORDER BY sale_date) AS prev_day_total,
    daily_total - LAG(daily_total, 1, 0) OVER (ORDER BY sale_date) AS daily_change,
    ROUND(
        (daily_total - LAG(daily_total, 1) OVER (ORDER BY sale_date)) * 100.0
        / NULLIF(LAG(daily_total, 1) OVER (ORDER BY sale_date), 0)
    , 2) AS percent_change
FROM daily_sales
ORDER BY sale_date;

Explanation

LAG(column, n, default): Access nth row BEFORE current
LEAD(column, n, default): Access nth row AFTER current

sale_date  | daily_total | prev_day | change | % change
2024-01-01 | 1000        | 0        | 1000   | NULL
2024-01-02 | 1500        | 1000     | 500    | 50.00%
2024-01-03 | 1200        | 1500     | -300   | -20.00%

Common Mistakes

Mistake 1: Forgetting NULL for first/last rows

sql
-- LAG returns NULL for first row (no previous)
-- Handle with COALESCE or default value
LAG(value, 1, 0)  -- Third param is default value

Mistake 2: Wrong offset

sql
-- LAG(value, 2) gets row 2 positions back, not 2nd column
SELECT LAG(amount, 2) OVER (ORDER BY date)  -- Amount from 2 days ago

Question 70: FIRST_VALUE and LAST_VALUE

Problem

For each order, show the customer's first ever purchase date.

Solution

sql
SELECT
    o.order_id,
    o.user_id,
    o.order_date,
    o.total_amount,
    FIRST_VALUE(o.order_date) OVER (
        PARTITION BY o.user_id
        ORDER BY o.order_date
    ) AS first_purchase_date,
    o.order_date - FIRST_VALUE(o.order_date) OVER (
        PARTITION BY o.user_id
        ORDER BY o.order_date
    ) AS days_since_first_purchase
FROM orders o
ORDER BY o.user_id, o.order_date;

Explanation

FIRST_VALUE(column): Returns value from first row in window
LAST_VALUE(column): Returns value from last row in window

Common Mistakes

Mistake 1: LAST_VALUE default frame issue

sql
-- LAST_VALUE often surprises people:
SELECT
    order_id,
    order_date,
    LAST_VALUE(order_date) OVER (ORDER BY order_date) AS last_date
FROM orders;
-- Returns current row's date, not actual last!

-- Why? Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- Fix with explicit frame:
SELECT
    order_id,
    LAST_VALUE(order_date) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS actual_last_date
FROM orders;

Question 71: NTH_VALUE - Getting Specific Position

Problem

Find the second highest sale for each employee.

Solution

sql
SELECT DISTINCT
    employee_id,
    NTH_VALUE(amount, 2) OVER (
        PARTITION BY employee_id
        ORDER BY amount DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS second_highest_sale
FROM employee_sales;

Explanation

NTH_VALUE(column, n) returns value at nth position in ordered window.

Common Mistakes

Mistake 1: Same frame issue as LAST_VALUE

sql
-- Must include full frame to access any position
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Mistake 2: NULL when fewer than n rows

sql
-- If employee has only 1 sale, NTH_VALUE(amount, 2) returns NULL
-- Handle appropriately

Question 72: NTILE - Creating Buckets

Problem

Divide customers into 4 spending tiers (quartiles).

Solution

sql
WITH customer_spending AS (
    SELECT
        user_id,
        SUM(total_amount) AS total_spent
    FROM orders
    WHERE status = 'completed'
    GROUP BY user_id
)
SELECT
    user_id,
    total_spent,
    NTILE(4) OVER (ORDER BY total_spent DESC) AS spending_quartile,
    CASE NTILE(4) OVER (ORDER BY total_spent DESC)
        WHEN 1 THEN 'Platinum'
        WHEN 2 THEN 'Gold'
        WHEN 3 THEN 'Silver'
        WHEN 4 THEN 'Bronze'
    END AS tier
FROM customer_spending
ORDER BY total_spent DESC;

Explanation

NTILE(n) divides rows into n roughly equal groups:

NTILE(4): Quartiles (25% each)
NTILE(10): Deciles (10% each)
NTILE(100): Percentiles

total_spent | ntile | tier
50000       | 1     | Platinum
45000       | 1     | Platinum
35000       | 2     | Gold
30000       | 2     | Gold
...

Common Mistakes

Mistake 1: Uneven distribution

sql
-- 10 rows into 4 buckets = 3, 3, 2, 2 distribution
-- Not exactly 25% each!

-- For exact percentiles, use PERCENT_RANK
SELECT
    value,
    NTILE(4) OVER (ORDER BY value) AS quartile,
    PERCENT_RANK() OVER (ORDER BY value) AS percentile
FROM data;

Question 73: PERCENT_RANK and CUME_DIST

Problem

Calculate percentile rank for each product's sales.

Solution

sql
WITH product_sales AS (
    SELECT
        p.product_id,
        p.name,
        SUM(oi.quantity * oi.unit_price) AS total_revenue
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id, p.name
)
SELECT
    product_id,
    name,
    total_revenue,
    ROUND(PERCENT_RANK() OVER (ORDER BY total_revenue) * 100, 2) AS percentile,
    ROUND(CUME_DIST() OVER (ORDER BY total_revenue) * 100, 2) AS cumulative_pct
FROM product_sales
ORDER BY total_revenue DESC;

Explanation

PERCENT_RANK(): (rank - 1) / (total_rows - 1)
- First row = 0, last row = 1
- "What percentage of rows are below this?"
CUME_DIST(): rank / total_rows
- "What percentage of rows are at or below this?"

revenue | PERCENT_RANK | CUME_DIST
100     | 0.00         | 0.25      -- 25% at or below
200     | 0.33         | 0.50      -- 50% at or below
300     | 0.67         | 0.75
400     | 1.00         | 1.00

When to Use

Identifying top/bottom performers
Statistical analysis
Relative positioning

Question 74: Aggregate Window Functions

Problem

Show each order with the customer's total spending and order count.

Solution

sql
SELECT
    o.order_id,
    o.user_id,
    o.order_date,
    o.total_amount,
    SUM(o.total_amount) OVER (PARTITION BY o.user_id) AS customer_lifetime_value,
    COUNT(*) OVER (PARTITION BY o.user_id) AS customer_order_count,
    ROUND(o.total_amount * 100.0 / SUM(o.total_amount) OVER (PARTITION BY o.user_id), 2) AS pct_of_customer_total
FROM orders o
WHERE o.status = 'completed'
ORDER BY o.user_id, o.order_date;

Explanation

Regular aggregates (SUM, COUNT, AVG, etc.) can be used as window functions:

order_id | user_id | amount | customer_total | order_count | pct
1        | 1       | 100    | 450            | 3           | 22.22%
5        | 1       | 150    | 450            | 3           | 33.33%
8        | 1       | 200    | 450            | 3           | 44.44%
2        | 2       | 300    | 500            | 2           | 60.00%
4        | 2       | 200    | 500            | 2           | 40.00%

Key difference from GROUP BY:

GROUP BY: One row per group
Window: All rows retained, aggregate value repeated

Question 75: Complex Window Expressions

Problem

For each sale, calculate:

Running total for the employee that month
Rank within their team
Difference from team's best performer

Solution

sql
WITH monthly_sales AS (
    SELECT
        employee_id,
        region,
        DATE_TRUNC('month', sale_date) AS month,
        SUM(amount) AS monthly_total
    FROM employee_sales
    GROUP BY employee_id, region, DATE_TRUNC('month', sale_date)
)
SELECT
    employee_id,
    region,
    month,
    monthly_total,

    -- Running total for employee
    SUM(monthly_total) OVER (
        PARTITION BY employee_id
        ORDER BY month
    ) AS employee_ytd,

    -- Rank within region for that month
    RANK() OVER (
        PARTITION BY region, month
        ORDER BY monthly_total DESC
    ) AS rank_in_region,

    -- Best in region that month
    MAX(monthly_total) OVER (
        PARTITION BY region, month
    ) AS region_best,

    -- Gap from best
    MAX(monthly_total) OVER (
        PARTITION BY region, month
    ) - monthly_total AS gap_from_best,

    -- Percentage of region total
    ROUND(
        monthly_total * 100.0 /
        SUM(monthly_total) OVER (PARTITION BY region, month)
    , 2) AS pct_of_region

FROM monthly_sales
ORDER BY month, region, rank_in_region;

Explanation

This combines multiple window calculations in one query:

Different PARTITION BY for different metrics
Running totals with ORDER BY
Comparisons to group maximums
Percentage calculations within groups

Performance Considerations

sql
-- Multiple window functions can share sorts
-- Database optimizes when PARTITION BY and ORDER BY match

EXPLAIN ANALYZE
SELECT
    id,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC),
    RANK() OVER (PARTITION BY region ORDER BY amount DESC),
    SUM(amount) OVER (PARTITION BY region)
FROM sales;
-- These share same partition/sort, efficient!

-- Different partitions require separate sorts
SELECT
    id,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount),
    ROW_NUMBER() OVER (PARTITION BY employee ORDER BY date)
FROM sales;
-- Two different sorts needed, less efficient

Summary: Aggregation Cheat Sheet

Basic Aggregates

Function	Purpose	NULL Handling
COUNT(*)	Count all rows	Counts NULLs
COUNT(col)	Count non-NULL	Ignores NULLs
SUM(col)	Total	Ignores NULLs
AVG(col)	Average	Ignores NULLs
MIN/MAX	Extremes	Ignores NULLs

GROUP BY Variants

Syntax	Result
GROUP BY a	Groups by a
GROUP BY a, b	Groups by combination
GROUP BY ROLLUP(a, b)	Groups + subtotals
GROUP BY CUBE(a, b)	All combinations
GROUP BY GROUPING SETS(...)	Custom groups

Window Function Anatomy

sql
function() OVER (
    PARTITION BY columns    -- Reset for each group
    ORDER BY columns        -- Order within partition
    frame_clause            -- Which rows to include
)

Common Window Functions

Function	Purpose
ROW_NUMBER()	Unique sequential numbers
RANK()	Ranking with gaps
DENSE_RANK()	Ranking without gaps
NTILE(n)	Divide into n buckets
LAG(col, n)	Previous nth row
LEAD(col, n)	Next nth row
FIRST_VALUE()	First in window
LAST_VALUE()	Last in window
PERCENT_RANK()	Percentile (0-1)

Frame Clauses

sql
ROWS BETWEEN start AND end

start/end options:
- UNBOUNDED PRECEDING
- n PRECEDING
- CURRENT ROW
- n FOLLOWING
- UNBOUNDED FOLLOWING

What's Next?

In Part 4: Subqueries and CTEs, we'll cover:

Scalar subqueries
Correlated subqueries
EXISTS vs IN performance
Common Table Expressions (CTEs)
Recursive CTEs for hierarchies
Subquery optimization

Keep practicing these aggregation patterns - they're essential for data analysis and reporting!