SQL Mastery Guide - Part 3: Aggregations and GROUP BY

Introduction

Welcome to Part 3 of our SQL Mastery series. In Parts 1 and 2, we covered SELECT fundamentals and all types of JOINs. Now we tackle one of SQL's most powerful features: aggregations.
Aggregations transform rows into meaningful summaries. Instead of seeing 10,000 orders, you see "total revenue by month." Instead of listing every employee, you see "average salary by department."
This part covers Questions 51-75, taking you from basic COUNT to advanced window functions.

Schema Reference

We continue with our e-commerce database, adding some new tables:
sql
-- Core Tables (from Part 1) CREATE TABLE users ( user_id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL, full_name VARCHAR(100), created_at TIMESTAMP DEFAULT NOW(), country VARCHAR(50), is_active BOOLEAN DEFAULT true ); CREATE TABLE categories ( category_id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, parent_category_id INT REFERENCES categories(category_id) ); CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, price DECIMAL(10,2), category_id INT REFERENCES categories(category_id), stock_quantity INT DEFAULT 0, created_at TIMESTAMP DEFAULT NOW() ); CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), order_date TIMESTAMP DEFAULT NOW(), status VARCHAR(20) DEFAULT 'pending', total_amount DECIMAL(10,2), shipping_address TEXT ); CREATE TABLE order_items ( item_id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(order_id), product_id INT REFERENCES products(product_id), quantity INT NOT NULL, unit_price DECIMAL(10,2) NOT NULL ); -- New Tables for Aggregation Practice CREATE TABLE page_views ( view_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), page_url VARCHAR(500), viewed_at TIMESTAMP DEFAULT NOW(), duration_seconds INT ); CREATE TABLE reviews ( review_id SERIAL PRIMARY KEY, product_id INT REFERENCES products(product_id), user_id INT REFERENCES users(user_id), rating INT CHECK (rating BETWEEN 1 AND 5), review_text TEXT, created_at TIMESTAMP DEFAULT NOW() ); CREATE TABLE employee_sales ( sale_id SERIAL PRIMARY KEY, employee_id INT, sale_date DATE, amount DECIMAL(10,2), region VARCHAR(50) );

Question 51: Basic COUNT - Counting All Rows

Problem

Count the total number of users in our system.

Solution

sql
SELECT COUNT(*) AS total_users FROM users;

Explanation

COUNT(*) counts every row in the table, regardless of NULL values. It's the simplest aggregation.
How it works internally:
  1. Database scans the table (or uses an index)
  2. Increments counter for each row
  3. Returns final count

Common Mistakes

Mistake 1: Using COUNT(column) when you want all rows
sql
-- This might give wrong count if email has NULLs SELECT COUNT(email) FROM users; -- This always counts all rows SELECT COUNT(*) FROM users;
Mistake 2: Thinking COUNT(*) is slow
sql
-- Many people write this thinking it's faster SELECT COUNT(1) FROM users; -- Modern databases optimize COUNT(*) the same way SELECT COUNT(*) FROM users;

Performance Considerations

  • PostgreSQL: COUNT(*) scans entire table by default (MVCC reasons)
  • MySQL InnoDB: Also needs table scan
  • MySQL MyISAM: Keeps row count cached (instant result)
For large tables, consider:
sql
-- Approximate count (much faster) SELECT reltuples::BIGINT AS approximate_count FROM pg_class WHERE relname = 'users';

When to Use

  • Getting total record counts
  • Dashboard statistics
  • Validation checks

When NOT to Use

  • When you need exact count on billion-row tables (consider approximations)
  • When you only need to know if rows exist (use EXISTS instead)

Question 52: COUNT with DISTINCT

Problem

Count how many unique countries our users are from.

Solution

sql
SELECT COUNT(DISTINCT country) AS unique_countries FROM users;

Explanation

COUNT(DISTINCT column) counts only unique non-NULL values.
Step-by-step:
  1. Scan country column
  2. Build hash set of unique values
  3. Ignore NULLs
  4. Return set size

Common Mistakes

Mistake 1: Forgetting NULLs aren't counted
sql
-- If 100 users have country, 10 have NULL -- This returns count of non-NULL unique countries SELECT COUNT(DISTINCT country) FROM users; -- Maybe 25 -- To include NULL as a country SELECT COUNT(DISTINCT country) + CASE WHEN EXISTS(SELECT 1 FROM users WHERE country IS NULL) THEN 1 ELSE 0 END FROM users;
Mistake 2: Using DISTINCT on primary keys
sql
-- Pointless - every user_id is already unique SELECT COUNT(DISTINCT user_id) FROM users; -- Same result, more efficient SELECT COUNT(*) FROM users;

Performance Considerations

sql
-- EXPLAIN ANALYZE this SELECT COUNT(DISTINCT country) FROM users; -- If slow, consider: -- 1. Index on country column CREATE INDEX idx_users_country ON users(country); -- 2. Materialized view for dashboards CREATE MATERIALIZED VIEW country_stats AS SELECT COUNT(DISTINCT country) AS unique_countries FROM users;

When to Use

  • Counting unique visitors, unique products sold, unique categories
  • De-duplication counts

Question 53: SUM - Calculating Totals

Problem

Calculate the total revenue from all orders.

Solution

sql
SELECT SUM(total_amount) AS total_revenue FROM orders WHERE status = 'completed';

Explanation

SUM(column) adds up all non-NULL values in the column.
Important behaviors:
  • NULL values are ignored
  • Returns NULL if all values are NULL
  • Returns NULL on empty result set

Common Mistakes

Mistake 1: Not handling NULLs
sql
-- If some orders have NULL total_amount SELECT SUM(total_amount) FROM orders; -- Might miss some orders -- Better: treat NULL as 0 SELECT SUM(COALESCE(total_amount, 0)) FROM orders;
Mistake 2: Integer overflow
sql
-- If amount is INTEGER and sum exceeds 2.1 billion SELECT SUM(amount) FROM huge_sales_table; -- Overflow error! -- Cast to BIGINT SELECT SUM(amount::BIGINT) FROM huge_sales_table;
Mistake 3: Precision loss with floats
sql
-- FLOAT can lose precision SELECT SUM(price) FROM products; -- 999.9999999997 -- Use DECIMAL for money ALTER TABLE products ALTER COLUMN price TYPE DECIMAL(10,2);

Performance Considerations

sql
-- Large table SUM can be slow EXPLAIN ANALYZE SELECT SUM(total_amount) FROM orders; -- Covering index helps CREATE INDEX idx_orders_sum ON orders(status) INCLUDE (total_amount); -- For real-time dashboards, use materialized views CREATE MATERIALIZED VIEW revenue_summary AS SELECT DATE_TRUNC('day', order_date) AS day, SUM(total_amount) AS daily_revenue FROM orders WHERE status = 'completed' GROUP BY DATE_TRUNC('day', order_date);

Question 54: AVG - Calculating Averages

Problem

Find the average order value for completed orders.

Solution

sql
SELECT AVG(total_amount) AS average_order_value FROM orders WHERE status = 'completed';

Explanation

AVG(column) = SUM(column) / COUNT(column)
NULL values are excluded from both numerator and denominator.

Common Mistakes

Mistake 1: Confusing AVG behavior with NULLs
sql
-- Table: scores(student_id, score) -- Data: (1, 100), (2, 80), (3, NULL) SELECT AVG(score) FROM scores; -- Returns 90, not 60! (180/2, not 180/3) -- If you want NULL treated as 0: SELECT AVG(COALESCE(score, 0)) FROM scores; -- Returns 60 (180/3)
Mistake 2: Not rounding for display
sql
-- Raw average has many decimal places SELECT AVG(total_amount) FROM orders; -- 125.7834921... -- Round appropriately SELECT ROUND(AVG(total_amount), 2) AS avg_order FROM orders; -- 125.78

Performance Considerations

Same as SUM - uses same scan, benefits from indexes.

When to Use

  • Calculating average order value, average session duration
  • Computing mean metrics

When NOT to Use

When you need median (for skewed data):
sql
-- Average can be misleading with outliers -- Order values: 50, 50, 50, 50, 10000 -- AVG = 2040 (not representative!) -- Use PERCENTILE_CONT for median SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) AS median_order FROM orders; -- Median = 50 (more representative)

Question 55: MIN and MAX

Problem

Find the cheapest and most expensive products.

Solution

sql
SELECT MIN(price) AS cheapest_price, MAX(price) AS most_expensive_price FROM products;

Explanation

  • MIN(column): Returns smallest value
  • MAX(column): Returns largest value
  • Both ignore NULLs

Common Mistakes

Mistake 1: Using MIN/MAX on wrong data types
sql
-- String comparison is alphabetical, not numeric! SELECT MAX(order_id) FROM orders; -- If order_id is VARCHAR -- '9' > '10' alphabetically! -- Ensure numeric types for numeric comparisons
Mistake 2: Trying to get the row with MIN/MAX
sql
-- This doesn't work as expected SELECT product_id, MIN(price) FROM products; -- Error or wrong! -- Correct approaches: -- Option 1: Subquery SELECT * FROM products WHERE price = (SELECT MIN(price) FROM products); -- Option 2: ORDER BY with LIMIT SELECT * FROM products ORDER BY price ASC LIMIT 1; -- Option 3: Window function SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY price ASC) AS rn FROM products ) sub WHERE rn = 1;

Performance Considerations

sql
-- MIN/MAX can use B-tree index efficiently CREATE INDEX idx_products_price ON products(price); -- With index, MIN/MAX is O(1) - just read first/last leaf EXPLAIN SELECT MIN(price) FROM products; -- Shows: Index Only Scan

When to Use

  • Finding bounds (earliest date, latest login)
  • Range queries
  • Data validation

Question 56: GROUP BY Basics

Problem

Count orders per user.

Solution

sql
SELECT user_id, COUNT(*) AS order_count FROM orders GROUP BY user_id ORDER BY order_count DESC;

Explanation

GROUP BY creates buckets of rows with same values, then applies aggregations to each bucket.
How it works:
  1. Scan table
  2. Hash rows by GROUP BY columns
  3. Apply aggregation to each group
  4. Return one row per group
Before GROUP BY: user_id | order_id | total 1 | 101 | 50 1 | 102 | 75 2 | 103 | 30 1 | 104 | 25 After GROUP BY user_id: user_id | COUNT(*) | SUM(total) 1 | 3 | 150 2 | 1 | 30

Common Mistakes

Mistake 1: Selecting non-grouped columns
sql
-- ERROR in standard SQL! SELECT user_id, order_date, COUNT(*) -- order_date not in GROUP BY FROM orders GROUP BY user_id; -- Which order_date should it show? Undefined! -- Fixed: SELECT user_id, MAX(order_date) AS latest_order, COUNT(*) FROM orders GROUP BY user_id;
Mistake 2: Misunderstanding GROUP BY with NULLs
sql
-- NULLs are grouped together SELECT country, COUNT(*) FROM users GROUP BY country; -- Returns a row for NULL country with count of NULL users

Performance Considerations

sql
-- GROUP BY can be expensive on large tables EXPLAIN ANALYZE SELECT user_id, COUNT(*) FROM orders GROUP BY user_id; -- Hash Aggregate vs Sort Aggregate -- Database chooses based on: -- - work_mem setting -- - Number of groups -- - Available indexes -- Index can help if sorting is chosen CREATE INDEX idx_orders_user ON orders(user_id);

When to Use

  • Any time you need per-group statistics
  • Reporting and analytics
  • Data summarization

Question 57: GROUP BY Multiple Columns

Problem

Calculate total sales per employee per region.

Solution

sql
SELECT employee_id, region, SUM(amount) AS total_sales, COUNT(*) AS sale_count FROM employee_sales GROUP BY employee_id, region ORDER BY employee_id, region;

Explanation

With multiple columns, GROUP BY creates groups for each unique combination.
employee_id | region | amount 1 | North | 100 1 | North | 150 1 | South | 200 2 | North | 300 Groups: (1, North) -> SUM: 250, COUNT: 2 (1, South) -> SUM: 200, COUNT: 1 (2, North) -> SUM: 300, COUNT: 1

Common Mistakes

Mistake 1: Wrong granularity
sql
-- Too many columns = very small groups SELECT employee_id, region, sale_date, SUM(amount) FROM employee_sales GROUP BY employee_id, region, sale_date; -- Might give one row per sale if dates are unique! -- Consider date truncation SELECT employee_id, region, DATE_TRUNC('month', sale_date), SUM(amount) FROM employee_sales GROUP BY employee_id, region, DATE_TRUNC('month', sale_date);

Performance Considerations

sql
-- Multi-column GROUP BY benefits from composite index CREATE INDEX idx_sales_emp_region ON employee_sales(employee_id, region); -- Order matters! Index on (employee_id, region) helps: GROUP BY employee_id, region -- Yes GROUP BY employee_id -- Yes GROUP BY region -- No (can't skip columns)

Question 58: HAVING - Filtering Groups

Problem

Find users who have placed more than 5 orders.

Solution

sql
SELECT user_id, COUNT(*) AS order_count FROM orders GROUP BY user_id HAVING COUNT(*) > 5 ORDER BY order_count DESC;

Explanation

HAVING filters groups AFTER aggregation. WHERE filters rows BEFORE aggregation.
Execution Order: 1. FROM orders 2. WHERE (filter rows) 3. GROUP BY user_id 4. HAVING COUNT(*) > 5 (filter groups) 5. SELECT 6. ORDER BY

Common Mistakes

Mistake 1: Using WHERE for aggregate conditions
sql
-- ERROR! SELECT user_id, COUNT(*) FROM orders WHERE COUNT(*) > 5 -- Can't use aggregate in WHERE GROUP BY user_id; -- Correct: use HAVING SELECT user_id, COUNT(*) FROM orders GROUP BY user_id HAVING COUNT(*) > 5;
Mistake 2: Using HAVING when WHERE would work
sql
-- Inefficient: filters after grouping SELECT user_id, COUNT(*) FROM orders GROUP BY user_id HAVING user_id > 100; -- Better: filter before grouping (fewer rows to group) SELECT user_id, COUNT(*) FROM orders WHERE user_id > 100 GROUP BY user_id;

Performance Considerations

sql
-- HAVING is applied after GROUP BY -- So all rows are still processed -- If possible, use WHERE to reduce rows BEFORE grouping -- WHERE filters: O(n) scan, then group smaller set -- HAVING filters: O(n) scan, group all, then filter EXPLAIN ANALYZE SELECT user_id, COUNT(*) FROM orders WHERE order_date > '2024-01-01' -- Reduces rows first GROUP BY user_id HAVING COUNT(*) > 5; -- Filters groups after

Question 59: Combining WHERE and HAVING

Problem

Find products with average rating above 4, considering only reviews from 2024.

Solution

sql
SELECT p.product_id, p.name, AVG(r.rating) AS avg_rating, COUNT(r.review_id) AS review_count FROM products p JOIN reviews r ON p.product_id = r.product_id WHERE r.created_at >= '2024-01-01' -- Filter rows GROUP BY p.product_id, p.name HAVING AVG(r.rating) > 4 -- Filter groups AND COUNT(r.review_id) >= 5 -- Minimum reviews ORDER BY avg_rating DESC;

Explanation

This query demonstrates the full flow:
  1. JOIN products and reviews
  2. WHERE filters to 2024 reviews only
  3. GROUP BY product
  4. HAVING filters to high-rated products with enough reviews
  5. ORDER BY rating

Common Mistakes

Mistake 1: Putting HAVING conditions in WHERE
sql
-- Wrong order causes errors SELECT product_id, AVG(rating) FROM reviews WHERE AVG(rating) > 4 -- ERROR! GROUP BY product_id;
Mistake 2: Not understanding NULL in HAVING
sql
-- Products with no reviews have NULL avg -- HAVING AVG(rating) > 4 excludes them automatically -- (NULL > 4 is UNKNOWN, not TRUE)

Question 60: Grouping by Expressions

Problem

Calculate total revenue by month.

Solution

sql
SELECT DATE_TRUNC('month', order_date) AS month, SUM(total_amount) AS monthly_revenue, COUNT(*) AS order_count FROM orders WHERE status = 'completed' GROUP BY DATE_TRUNC('month', order_date) ORDER BY month;

Explanation

You can GROUP BY any expression, not just columns. The database evaluates the expression for each row, then groups by result.

Common Mistakes

Mistake 1: Expression mismatch
sql
-- SELECT and GROUP BY expressions must match exactly SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) FROM orders GROUP BY month; -- Some databases don't allow alias reference -- Safer: repeat expression SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) FROM orders GROUP BY DATE_TRUNC('month', order_date); -- Or use position SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) FROM orders GROUP BY 1; -- Group by first SELECT expression

Performance Considerations

sql
-- Expression in GROUP BY can't use regular index -- Computed during scan -- Create expression index for better performance CREATE INDEX idx_orders_month ON orders(DATE_TRUNC('month', order_date)); -- Now the GROUP BY can use index EXPLAIN ANALYZE SELECT DATE_TRUNC('month', order_date), COUNT(*) FROM orders GROUP BY DATE_TRUNC('month', order_date);

Question 61: ROLLUP - Subtotals and Grand Totals

Problem

Show sales by region with subtotals and grand total.

Solution

sql
SELECT COALESCE(region, 'TOTAL') AS region, SUM(amount) AS total_sales, COUNT(*) AS sale_count FROM employee_sales GROUP BY ROLLUP(region) ORDER BY CASE WHEN region IS NULL THEN 1 ELSE 0 END, region;

Explanation

ROLLUP creates grouping sets that include subtotals:
  • GROUP BY ROLLUP(region) produces:
    • Groups for each region
    • Grand total (all regions combined)
Result: region | total_sales | sale_count North | 50000 | 120 South | 45000 | 100 West | 60000 | 150 TOTAL | 155000 | 370
For multiple columns:
sql
GROUP BY ROLLUP(region, employee_id) -- Produces: -- (region, employee_id) - per employee per region -- (region) - subtotal per region -- () - grand total

Common Mistakes

Mistake 1: Not distinguishing NULL data from subtotal rows
sql
-- If region can be NULL in data, how to tell NULL region from subtotal? -- Use GROUPING() function SELECT region, SUM(amount), GROUPING(region) AS is_subtotal -- 1 if subtotal row, 0 if actual data FROM employee_sales GROUP BY ROLLUP(region);

Performance Considerations

ROLLUP adds extra grouping operations but is more efficient than UNION ALL of separate queries.

Question 62: CUBE - All Combinations

Problem

Analyze sales by both region and quarter with all subtotals.

Solution

sql
SELECT COALESCE(region, 'All Regions') AS region, COALESCE(DATE_TRUNC('quarter', sale_date)::TEXT, 'All Quarters') AS quarter, SUM(amount) AS total_sales FROM employee_sales GROUP BY CUBE(region, DATE_TRUNC('quarter', sale_date)) ORDER BY region, quarter;

Explanation

CUBE creates all possible combinations:
  • GROUP BY CUBE(region, quarter)
    produces:
    • (region, quarter) - per region per quarter
    • (region) - per region, all quarters
    • (quarter) - per quarter, all regions
    • () - grand total
More combinations than ROLLUP but gives complete dimensional analysis.

When to Use

  • OLAP-style reporting
  • Creating pivot-table-like summaries
  • When you need all possible subtotals

Question 63: GROUPING SETS - Custom Groups

Problem

Get sales totals by region, by employee, and overall (but not region+employee combination).

Solution

sql
SELECT region, employee_id, SUM(amount) AS total_sales FROM employee_sales GROUP BY GROUPING SETS ( (region), (employee_id), () ) ORDER BY region NULLS LAST, employee_id NULLS LAST;

Explanation

GROUPING SETS lets you specify exactly which groupings you want:
  • More flexible than ROLLUP/CUBE
  • Avoids computing unwanted combinations
Equivalent to: SELECT region, NULL, SUM(amount) FROM employee_sales GROUP BY region UNION ALL SELECT NULL, employee_id, SUM(amount) FROM employee_sales GROUP BY employee_id UNION ALL SELECT NULL, NULL, SUM(amount) FROM employee_sales
But more efficient (single scan).

Question 64: Window Functions Introduction - ROW_NUMBER

Problem

Rank users by their total order value.

Solution

sql
SELECT u.user_id, u.full_name, SUM(o.total_amount) AS total_spent, ROW_NUMBER() OVER (ORDER BY SUM(o.total_amount) DESC) AS rank FROM users u JOIN orders o ON u.user_id = o.user_id WHERE o.status = 'completed' GROUP BY u.user_id, u.full_name ORDER BY rank;

Explanation

Window functions perform calculations across related rows WITHOUT collapsing them into one row like GROUP BY.
ROW_NUMBER() assigns sequential numbers:
user_id | total_spent | rank 5 | 10000 | 1 3 | 8500 | 2 8 | 8500 | 3 -- Different rank despite same amount 1 | 7200 | 4
Syntax:
function() OVER (ORDER BY ...)

Common Mistakes

Mistake 1: Confusing with GROUP BY
sql
-- This is NOT window function (collapses rows) SELECT user_id, SUM(total_amount) FROM orders GROUP BY user_id; -- This IS window function (keeps all rows) SELECT order_id, user_id, total_amount, SUM(total_amount) OVER (PARTITION BY user_id) AS user_total FROM orders;
Mistake 2: Using ROW_NUMBER for ties
sql
-- ROW_NUMBER gives different ranks for same values -- Use RANK() or DENSE_RANK() if ties should match

Question 65: RANK and DENSE_RANK

Problem

Rank products by sales, handling ties appropriately.

Solution

sql
WITH product_sales AS ( SELECT p.product_id, p.name, SUM(oi.quantity) AS units_sold FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name ) SELECT product_id, name, units_sold, ROW_NUMBER() OVER (ORDER BY units_sold DESC) AS row_num, RANK() OVER (ORDER BY units_sold DESC) AS rank, DENSE_RANK() OVER (ORDER BY units_sold DESC) AS dense_rank FROM product_sales;

Explanation

Compare the three ranking functions:
units_sold | ROW_NUMBER | RANK | DENSE_RANK 100 | 1 | 1 | 1 95 | 2 | 2 | 2 95 | 3 | 2 | 2 -- Tie! 90 | 4 | 4 | 3 -- Key difference
  • ROW_NUMBER: Always unique, arbitrary order for ties
  • RANK: Same rank for ties, gaps after
  • DENSE_RANK: Same rank for ties, no gaps

When to Use Which

  • ROW_NUMBER: Pagination, unique ordering needed
  • RANK: Competition ranking (1st, 2nd, 2nd, 4th)
  • DENSE_RANK: When you need sequential values (1st, 2nd, 2nd, 3rd)

Question 66: PARTITION BY - Groups Within Window

Problem

Rank products within each category by price.

Solution

sql
SELECT category_id, product_id, name, price, RANK() OVER ( PARTITION BY category_id ORDER BY price DESC ) AS price_rank_in_category FROM products WHERE price IS NOT NULL ORDER BY category_id, price_rank_in_category;

Explanation

PARTITION BY divides rows into groups, and window function restarts for each group.
category_id | product | price | rank 1 | Laptop | 1500 | 1 1 | Tablet | 800 | 2 1 | Mouse | 50 | 3 2 | Shirt | 100 | 1 -- Rank restarts! 2 | Hat | 30 | 2
Without PARTITION BY, ranking would be across ALL products.

Common Mistakes

Mistake 1: Confusing PARTITION BY with GROUP BY
sql
-- GROUP BY: Collapses rows, one output per group SELECT category_id, MAX(price) FROM products GROUP BY category_id; -- PARTITION BY: Keeps all rows, computes value for each SELECT category_id, product_id, price, MAX(price) OVER (PARTITION BY category_id) AS max_in_category FROM products;

Question 67: Running Totals with SUM OVER

Problem

Calculate running total of daily sales.

Solution

sql
SELECT sale_date, SUM(amount) AS daily_total, SUM(SUM(amount)) OVER (ORDER BY sale_date) AS running_total FROM employee_sales GROUP BY sale_date ORDER BY sale_date;

Explanation

SUM() OVER (ORDER BY ...) creates a running total:
sale_date | daily_total | running_total 2024-01-01 | 1000 | 1000 2024-01-02 | 1500 | 2500 2024-01-03 | 800 | 3300 2024-01-04 | 2000 | 5300
The ORDER BY in OVER defines the running order. Each row includes sum of all previous rows.

Common Mistakes

Mistake 1: Not understanding default frame
sql
-- With ORDER BY, default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- This means: sum from start to current row -- Without ORDER BY, default is all rows in partition SELECT SUM(amount) OVER () FROM sales; -- Same value for every row (grand total)

Performance Considerations

sql
-- Running totals are efficient - one pass through data -- But can still be slow on huge datasets -- For very large tables, consider pre-computing: CREATE TABLE daily_running_totals AS SELECT sale_date, SUM(amount) AS daily_total, SUM(SUM(amount)) OVER (ORDER BY sale_date) AS running_total FROM employee_sales GROUP BY sale_date;

Question 68: Moving Averages with Window Frames

Problem

Calculate 7-day moving average of daily page views.

Solution

sql
WITH daily_views AS ( SELECT DATE(viewed_at) AS view_date, COUNT(*) AS daily_count FROM page_views GROUP BY DATE(viewed_at) ) SELECT view_date, daily_count, ROUND(AVG(daily_count) OVER ( ORDER BY view_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ), 2) AS moving_avg_7day FROM daily_views ORDER BY view_date;

Explanation

Window frames define exactly which rows to include in calculation:
sql
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- Current row + 6 previous = 7 rows total -- Other frame options: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- All previous ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING -- 7-row centered ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- Current to end

Common Mistakes

Mistake 1: ROWS vs RANGE
sql
-- ROWS: Exact number of rows ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- RANGE: Value-based (can include multiple rows with same value) RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW -- For dates with gaps, RANGE might be more appropriate
Mistake 2: Incomplete windows at edges
sql
-- First 6 rows won't have full 7-day average -- Handle appropriately: SELECT view_date, daily_count, CASE WHEN COUNT(*) OVER (ORDER BY view_date ROWS 6 PRECEDING) < 7 THEN NULL -- Not enough data ELSE AVG(daily_count) OVER (ORDER BY view_date ROWS 6 PRECEDING) END AS moving_avg_7day FROM daily_views;

Question 69: LAG and LEAD - Accessing Other Rows

Problem

Compare each day's sales to the previous day.

Solution

sql
WITH daily_sales AS ( SELECT sale_date, SUM(amount) AS daily_total FROM employee_sales GROUP BY sale_date ) SELECT sale_date, daily_total, LAG(daily_total, 1, 0) OVER (ORDER BY sale_date) AS prev_day_total, daily_total - LAG(daily_total, 1, 0) OVER (ORDER BY sale_date) AS daily_change, ROUND( (daily_total - LAG(daily_total, 1) OVER (ORDER BY sale_date)) * 100.0 / NULLIF(LAG(daily_total, 1) OVER (ORDER BY sale_date), 0) , 2) AS percent_change FROM daily_sales ORDER BY sale_date;

Explanation

  • LAG(column, n, default): Access nth row BEFORE current
  • LEAD(column, n, default): Access nth row AFTER current
sale_date | daily_total | prev_day | change | % change 2024-01-01 | 1000 | 0 | 1000 | NULL 2024-01-02 | 1500 | 1000 | 500 | 50.00% 2024-01-03 | 1200 | 1500 | -300 | -20.00%

Common Mistakes

Mistake 1: Forgetting NULL for first/last rows
sql
-- LAG returns NULL for first row (no previous) -- Handle with COALESCE or default value LAG(value, 1, 0) -- Third param is default value
Mistake 2: Wrong offset
sql
-- LAG(value, 2) gets row 2 positions back, not 2nd column SELECT LAG(amount, 2) OVER (ORDER BY date) -- Amount from 2 days ago

Question 70: FIRST_VALUE and LAST_VALUE

Problem

For each order, show the customer's first ever purchase date.

Solution

sql
SELECT o.order_id, o.user_id, o.order_date, o.total_amount, FIRST_VALUE(o.order_date) OVER ( PARTITION BY o.user_id ORDER BY o.order_date ) AS first_purchase_date, o.order_date - FIRST_VALUE(o.order_date) OVER ( PARTITION BY o.user_id ORDER BY o.order_date ) AS days_since_first_purchase FROM orders o ORDER BY o.user_id, o.order_date;

Explanation

  • FIRST_VALUE(column): Returns value from first row in window
  • LAST_VALUE(column): Returns value from last row in window

Common Mistakes

Mistake 1: LAST_VALUE default frame issue
sql
-- LAST_VALUE often surprises people: SELECT order_id, order_date, LAST_VALUE(order_date) OVER (ORDER BY order_date) AS last_date FROM orders; -- Returns current row's date, not actual last! -- Why? Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- Fix with explicit frame: SELECT order_id, LAST_VALUE(order_date) OVER ( ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS actual_last_date FROM orders;

Question 71: NTH_VALUE - Getting Specific Position

Problem

Find the second highest sale for each employee.

Solution

sql
SELECT DISTINCT employee_id, NTH_VALUE(amount, 2) OVER ( PARTITION BY employee_id ORDER BY amount DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_highest_sale FROM employee_sales;

Explanation

NTH_VALUE(column, n) returns value at nth position in ordered window.

Common Mistakes

Mistake 1: Same frame issue as LAST_VALUE
sql
-- Must include full frame to access any position ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Mistake 2: NULL when fewer than n rows
sql
-- If employee has only 1 sale, NTH_VALUE(amount, 2) returns NULL -- Handle appropriately

Question 72: NTILE - Creating Buckets

Problem

Divide customers into 4 spending tiers (quartiles).

Solution

sql
WITH customer_spending AS ( SELECT user_id, SUM(total_amount) AS total_spent FROM orders WHERE status = 'completed' GROUP BY user_id ) SELECT user_id, total_spent, NTILE(4) OVER (ORDER BY total_spent DESC) AS spending_quartile, CASE NTILE(4) OVER (ORDER BY total_spent DESC) WHEN 1 THEN 'Platinum' WHEN 2 THEN 'Gold' WHEN 3 THEN 'Silver' WHEN 4 THEN 'Bronze' END AS tier FROM customer_spending ORDER BY total_spent DESC;

Explanation

NTILE(n) divides rows into n roughly equal groups:
  • NTILE(4): Quartiles (25% each)
  • NTILE(10): Deciles (10% each)
  • NTILE(100): Percentiles
total_spent | ntile | tier 50000 | 1 | Platinum 45000 | 1 | Platinum 35000 | 2 | Gold 30000 | 2 | Gold ...

Common Mistakes

Mistake 1: Uneven distribution
sql
-- 10 rows into 4 buckets = 3, 3, 2, 2 distribution -- Not exactly 25% each! -- For exact percentiles, use PERCENT_RANK SELECT value, NTILE(4) OVER (ORDER BY value) AS quartile, PERCENT_RANK() OVER (ORDER BY value) AS percentile FROM data;

Question 73: PERCENT_RANK and CUME_DIST

Problem

Calculate percentile rank for each product's sales.

Solution

sql
WITH product_sales AS ( SELECT p.product_id, p.name, SUM(oi.quantity * oi.unit_price) AS total_revenue FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name ) SELECT product_id, name, total_revenue, ROUND(PERCENT_RANK() OVER (ORDER BY total_revenue) * 100, 2) AS percentile, ROUND(CUME_DIST() OVER (ORDER BY total_revenue) * 100, 2) AS cumulative_pct FROM product_sales ORDER BY total_revenue DESC;

Explanation

  • PERCENT_RANK(): (rank - 1) / (total_rows - 1)
    • First row = 0, last row = 1
    • "What percentage of rows are below this?"
  • CUME_DIST(): rank / total_rows
    • "What percentage of rows are at or below this?"
revenue | PERCENT_RANK | CUME_DIST 100 | 0.00 | 0.25 -- 25% at or below 200 | 0.33 | 0.50 -- 50% at or below 300 | 0.67 | 0.75 400 | 1.00 | 1.00

When to Use

  • Identifying top/bottom performers
  • Statistical analysis
  • Relative positioning

Question 74: Aggregate Window Functions

Problem

Show each order with the customer's total spending and order count.

Solution

sql
SELECT o.order_id, o.user_id, o.order_date, o.total_amount, SUM(o.total_amount) OVER (PARTITION BY o.user_id) AS customer_lifetime_value, COUNT(*) OVER (PARTITION BY o.user_id) AS customer_order_count, ROUND(o.total_amount * 100.0 / SUM(o.total_amount) OVER (PARTITION BY o.user_id), 2) AS pct_of_customer_total FROM orders o WHERE o.status = 'completed' ORDER BY o.user_id, o.order_date;

Explanation

Regular aggregates (SUM, COUNT, AVG, etc.) can be used as window functions:
order_id | user_id | amount | customer_total | order_count | pct 1 | 1 | 100 | 450 | 3 | 22.22% 5 | 1 | 150 | 450 | 3 | 33.33% 8 | 1 | 200 | 450 | 3 | 44.44% 2 | 2 | 300 | 500 | 2 | 60.00% 4 | 2 | 200 | 500 | 2 | 40.00%
Key difference from GROUP BY:
  • GROUP BY: One row per group
  • Window: All rows retained, aggregate value repeated

Question 75: Complex Window Expressions

Problem

For each sale, calculate:
  1. Running total for the employee that month
  2. Rank within their team
  3. Difference from team's best performer

Solution

sql
WITH monthly_sales AS ( SELECT employee_id, region, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS monthly_total FROM employee_sales GROUP BY employee_id, region, DATE_TRUNC('month', sale_date) ) SELECT employee_id, region, month, monthly_total, -- Running total for employee SUM(monthly_total) OVER ( PARTITION BY employee_id ORDER BY month ) AS employee_ytd, -- Rank within region for that month RANK() OVER ( PARTITION BY region, month ORDER BY monthly_total DESC ) AS rank_in_region, -- Best in region that month MAX(monthly_total) OVER ( PARTITION BY region, month ) AS region_best, -- Gap from best MAX(monthly_total) OVER ( PARTITION BY region, month ) - monthly_total AS gap_from_best, -- Percentage of region total ROUND( monthly_total * 100.0 / SUM(monthly_total) OVER (PARTITION BY region, month) , 2) AS pct_of_region FROM monthly_sales ORDER BY month, region, rank_in_region;

Explanation

This combines multiple window calculations in one query:
  1. Different PARTITION BY for different metrics
  2. Running totals with ORDER BY
  3. Comparisons to group maximums
  4. Percentage calculations within groups

Performance Considerations

sql
-- Multiple window functions can share sorts -- Database optimizes when PARTITION BY and ORDER BY match EXPLAIN ANALYZE SELECT id, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC), RANK() OVER (PARTITION BY region ORDER BY amount DESC), SUM(amount) OVER (PARTITION BY region) FROM sales; -- These share same partition/sort, efficient! -- Different partitions require separate sorts SELECT id, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount), ROW_NUMBER() OVER (PARTITION BY employee ORDER BY date) FROM sales; -- Two different sorts needed, less efficient

Summary: Aggregation Cheat Sheet

Basic Aggregates

FunctionPurposeNULL Handling
COUNT(*)Count all rowsCounts NULLs
COUNT(col)Count non-NULLIgnores NULLs
SUM(col)TotalIgnores NULLs
AVG(col)AverageIgnores NULLs
MIN/MAXExtremesIgnores NULLs

GROUP BY Variants

SyntaxResult
GROUP BY aGroups by a
GROUP BY a, bGroups by combination
GROUP BY ROLLUP(a, b)Groups + subtotals
GROUP BY CUBE(a, b)All combinations
GROUP BY GROUPING SETS(...)Custom groups

Window Function Anatomy

sql
function() OVER ( PARTITION BY columns -- Reset for each group ORDER BY columns -- Order within partition frame_clause -- Which rows to include )

Common Window Functions

FunctionPurpose
ROW_NUMBER()Unique sequential numbers
RANK()Ranking with gaps
DENSE_RANK()Ranking without gaps
NTILE(n)Divide into n buckets
LAG(col, n)Previous nth row
LEAD(col, n)Next nth row
FIRST_VALUE()First in window
LAST_VALUE()Last in window
PERCENT_RANK()Percentile (0-1)

Frame Clauses

sql
ROWS BETWEEN start AND end start/end options: - UNBOUNDED PRECEDING - n PRECEDING - CURRENT ROW - n FOLLOWING - UNBOUNDED FOLLOWING

What's Next?

In Part 4: Subqueries and CTEs, we'll cover:
  • Scalar subqueries
  • Correlated subqueries
  • EXISTS vs IN performance
  • Common Table Expressions (CTEs)
  • Recursive CTEs for hierarchies
  • Subquery optimization
Keep practicing these aggregation patterns - they're essential for data analysis and reporting!
All Blogs
Tags:sqlaggregationsgroup bywindow functionsperformanceinterview