SQL Mastery Guide - Part 3: Aggregations and GROUP BY
Introduction
Welcome to Part 3 of our SQL Mastery series. In Parts 1 and 2, we covered SELECT fundamentals and all types of JOINs. Now we tackle one of SQL's most powerful features: aggregations.
Aggregations transform rows into meaningful summaries. Instead of seeing 10,000 orders, you see "total revenue by month." Instead of listing every employee, you see "average salary by department."
This part covers Questions 51-75, taking you from basic COUNT to advanced window functions.
Schema Reference
We continue with our e-commerce database, adding some new tables:
sql-- Core Tables (from Part 1) CREATE TABLE users ( user_id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL, full_name VARCHAR(100), created_at TIMESTAMP DEFAULT NOW(), country VARCHAR(50), is_active BOOLEAN DEFAULT true ); CREATE TABLE categories ( category_id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, parent_category_id INT REFERENCES categories(category_id) ); CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, price DECIMAL(10,2), category_id INT REFERENCES categories(category_id), stock_quantity INT DEFAULT 0, created_at TIMESTAMP DEFAULT NOW() ); CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), order_date TIMESTAMP DEFAULT NOW(), status VARCHAR(20) DEFAULT 'pending', total_amount DECIMAL(10,2), shipping_address TEXT ); CREATE TABLE order_items ( item_id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(order_id), product_id INT REFERENCES products(product_id), quantity INT NOT NULL, unit_price DECIMAL(10,2) NOT NULL ); -- New Tables for Aggregation Practice CREATE TABLE page_views ( view_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), page_url VARCHAR(500), viewed_at TIMESTAMP DEFAULT NOW(), duration_seconds INT ); CREATE TABLE reviews ( review_id SERIAL PRIMARY KEY, product_id INT REFERENCES products(product_id), user_id INT REFERENCES users(user_id), rating INT CHECK (rating BETWEEN 1 AND 5), review_text TEXT, created_at TIMESTAMP DEFAULT NOW() ); CREATE TABLE employee_sales ( sale_id SERIAL PRIMARY KEY, employee_id INT, sale_date DATE, amount DECIMAL(10,2), region VARCHAR(50) );
Question 51: Basic COUNT - Counting All Rows
Problem
Count the total number of users in our system.
Solution
sqlSELECT COUNT(*) AS total_users FROM users;
Explanation
COUNT(*) counts every row in the table, regardless of NULL values. It's the simplest aggregation.
How it works internally:
- Database scans the table (or uses an index)
- Increments counter for each row
- Returns final count
Common Mistakes
Mistake 1: Using COUNT(column) when you want all rows
sql-- This might give wrong count if email has NULLs SELECT COUNT(email) FROM users; -- This always counts all rows SELECT COUNT(*) FROM users;
Mistake 2: Thinking COUNT(*) is slow
sql-- Many people write this thinking it's faster SELECT COUNT(1) FROM users; -- Modern databases optimize COUNT(*) the same way SELECT COUNT(*) FROM users;
Performance Considerations
- PostgreSQL: COUNT(*) scans entire table by default (MVCC reasons)
- MySQL InnoDB: Also needs table scan
- MySQL MyISAM: Keeps row count cached (instant result)
For large tables, consider:
sql-- Approximate count (much faster) SELECT reltuples::BIGINT AS approximate_count FROM pg_class WHERE relname = 'users';
When to Use
- Getting total record counts
- Dashboard statistics
- Validation checks
When NOT to Use
- When you need exact count on billion-row tables (consider approximations)
- When you only need to know if rows exist (use EXISTS instead)
Question 52: COUNT with DISTINCT
Problem
Count how many unique countries our users are from.
Solution
sqlSELECT COUNT(DISTINCT country) AS unique_countries FROM users;
Explanation
COUNT(DISTINCT column) counts only unique non-NULL values.
Step-by-step:
- Scan country column
- Build hash set of unique values
- Ignore NULLs
- Return set size
Common Mistakes
Mistake 1: Forgetting NULLs aren't counted
sql-- If 100 users have country, 10 have NULL -- This returns count of non-NULL unique countries SELECT COUNT(DISTINCT country) FROM users; -- Maybe 25 -- To include NULL as a country SELECT COUNT(DISTINCT country) + CASE WHEN EXISTS(SELECT 1 FROM users WHERE country IS NULL) THEN 1 ELSE 0 END FROM users;
Mistake 2: Using DISTINCT on primary keys
sql-- Pointless - every user_id is already unique SELECT COUNT(DISTINCT user_id) FROM users; -- Same result, more efficient SELECT COUNT(*) FROM users;
Performance Considerations
sql-- EXPLAIN ANALYZE this SELECT COUNT(DISTINCT country) FROM users; -- If slow, consider: -- 1. Index on country column CREATE INDEX idx_users_country ON users(country); -- 2. Materialized view for dashboards CREATE MATERIALIZED VIEW country_stats AS SELECT COUNT(DISTINCT country) AS unique_countries FROM users;
When to Use
- Counting unique visitors, unique products sold, unique categories
- De-duplication counts
Question 53: SUM - Calculating Totals
Problem
Calculate the total revenue from all orders.
Solution
sqlSELECT SUM(total_amount) AS total_revenue FROM orders WHERE status = 'completed';
Explanation
SUM(column) adds up all non-NULL values in the column.
Important behaviors:
- NULL values are ignored
- Returns NULL if all values are NULL
- Returns NULL on empty result set
Common Mistakes
Mistake 1: Not handling NULLs
sql-- If some orders have NULL total_amount SELECT SUM(total_amount) FROM orders; -- Might miss some orders -- Better: treat NULL as 0 SELECT SUM(COALESCE(total_amount, 0)) FROM orders;
Mistake 2: Integer overflow
sql-- If amount is INTEGER and sum exceeds 2.1 billion SELECT SUM(amount) FROM huge_sales_table; -- Overflow error! -- Cast to BIGINT SELECT SUM(amount::BIGINT) FROM huge_sales_table;
Mistake 3: Precision loss with floats
sql-- FLOAT can lose precision SELECT SUM(price) FROM products; -- 999.9999999997 -- Use DECIMAL for money ALTER TABLE products ALTER COLUMN price TYPE DECIMAL(10,2);
Performance Considerations
sql-- Large table SUM can be slow EXPLAIN ANALYZE SELECT SUM(total_amount) FROM orders; -- Covering index helps CREATE INDEX idx_orders_sum ON orders(status) INCLUDE (total_amount); -- For real-time dashboards, use materialized views CREATE MATERIALIZED VIEW revenue_summary AS SELECT DATE_TRUNC('day', order_date) AS day, SUM(total_amount) AS daily_revenue FROM orders WHERE status = 'completed' GROUP BY DATE_TRUNC('day', order_date);
Question 54: AVG - Calculating Averages
Problem
Find the average order value for completed orders.
Solution
sqlSELECT AVG(total_amount) AS average_order_value FROM orders WHERE status = 'completed';
Explanation
AVG(column) = SUM(column) / COUNT(column)
NULL values are excluded from both numerator and denominator.
Common Mistakes
Mistake 1: Confusing AVG behavior with NULLs
sql-- Table: scores(student_id, score) -- Data: (1, 100), (2, 80), (3, NULL) SELECT AVG(score) FROM scores; -- Returns 90, not 60! (180/2, not 180/3) -- If you want NULL treated as 0: SELECT AVG(COALESCE(score, 0)) FROM scores; -- Returns 60 (180/3)
Mistake 2: Not rounding for display
sql-- Raw average has many decimal places SELECT AVG(total_amount) FROM orders; -- 125.7834921... -- Round appropriately SELECT ROUND(AVG(total_amount), 2) AS avg_order FROM orders; -- 125.78
Performance Considerations
Same as SUM - uses same scan, benefits from indexes.
When to Use
- Calculating average order value, average session duration
- Computing mean metrics
When NOT to Use
When you need median (for skewed data):
sql-- Average can be misleading with outliers -- Order values: 50, 50, 50, 50, 10000 -- AVG = 2040 (not representative!) -- Use PERCENTILE_CONT for median SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) AS median_order FROM orders; -- Median = 50 (more representative)
Question 55: MIN and MAX
Problem
Find the cheapest and most expensive products.
Solution
sqlSELECT MIN(price) AS cheapest_price, MAX(price) AS most_expensive_price FROM products;
Explanation
- MIN(column): Returns smallest value
- MAX(column): Returns largest value
- Both ignore NULLs
Common Mistakes
Mistake 1: Using MIN/MAX on wrong data types
sql-- String comparison is alphabetical, not numeric! SELECT MAX(order_id) FROM orders; -- If order_id is VARCHAR -- '9' > '10' alphabetically! -- Ensure numeric types for numeric comparisons
Mistake 2: Trying to get the row with MIN/MAX
sql-- This doesn't work as expected SELECT product_id, MIN(price) FROM products; -- Error or wrong! -- Correct approaches: -- Option 1: Subquery SELECT * FROM products WHERE price = (SELECT MIN(price) FROM products); -- Option 2: ORDER BY with LIMIT SELECT * FROM products ORDER BY price ASC LIMIT 1; -- Option 3: Window function SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY price ASC) AS rn FROM products ) sub WHERE rn = 1;
Performance Considerations
sql-- MIN/MAX can use B-tree index efficiently CREATE INDEX idx_products_price ON products(price); -- With index, MIN/MAX is O(1) - just read first/last leaf EXPLAIN SELECT MIN(price) FROM products; -- Shows: Index Only Scan
When to Use
- Finding bounds (earliest date, latest login)
- Range queries
- Data validation
Question 56: GROUP BY Basics
Problem
Count orders per user.
Solution
sqlSELECT user_id, COUNT(*) AS order_count FROM orders GROUP BY user_id ORDER BY order_count DESC;
Explanation
GROUP BY creates buckets of rows with same values, then applies aggregations to each bucket.
How it works:
- Scan table
- Hash rows by GROUP BY columns
- Apply aggregation to each group
- Return one row per group
Before GROUP BY: user_id | order_id | total 1 | 101 | 50 1 | 102 | 75 2 | 103 | 30 1 | 104 | 25 After GROUP BY user_id: user_id | COUNT(*) | SUM(total) 1 | 3 | 150 2 | 1 | 30
Common Mistakes
Mistake 1: Selecting non-grouped columns
sql-- ERROR in standard SQL! SELECT user_id, order_date, COUNT(*) -- order_date not in GROUP BY FROM orders GROUP BY user_id; -- Which order_date should it show? Undefined! -- Fixed: SELECT user_id, MAX(order_date) AS latest_order, COUNT(*) FROM orders GROUP BY user_id;
Mistake 2: Misunderstanding GROUP BY with NULLs
sql-- NULLs are grouped together SELECT country, COUNT(*) FROM users GROUP BY country; -- Returns a row for NULL country with count of NULL users
Performance Considerations
sql-- GROUP BY can be expensive on large tables EXPLAIN ANALYZE SELECT user_id, COUNT(*) FROM orders GROUP BY user_id; -- Hash Aggregate vs Sort Aggregate -- Database chooses based on: -- - work_mem setting -- - Number of groups -- - Available indexes -- Index can help if sorting is chosen CREATE INDEX idx_orders_user ON orders(user_id);
When to Use
- Any time you need per-group statistics
- Reporting and analytics
- Data summarization
Question 57: GROUP BY Multiple Columns
Problem
Calculate total sales per employee per region.
Solution
sqlSELECT employee_id, region, SUM(amount) AS total_sales, COUNT(*) AS sale_count FROM employee_sales GROUP BY employee_id, region ORDER BY employee_id, region;
Explanation
With multiple columns, GROUP BY creates groups for each unique combination.
employee_id | region | amount 1 | North | 100 1 | North | 150 1 | South | 200 2 | North | 300 Groups: (1, North) -> SUM: 250, COUNT: 2 (1, South) -> SUM: 200, COUNT: 1 (2, North) -> SUM: 300, COUNT: 1
Common Mistakes
Mistake 1: Wrong granularity
sql-- Too many columns = very small groups SELECT employee_id, region, sale_date, SUM(amount) FROM employee_sales GROUP BY employee_id, region, sale_date; -- Might give one row per sale if dates are unique! -- Consider date truncation SELECT employee_id, region, DATE_TRUNC('month', sale_date), SUM(amount) FROM employee_sales GROUP BY employee_id, region, DATE_TRUNC('month', sale_date);
Performance Considerations
sql-- Multi-column GROUP BY benefits from composite index CREATE INDEX idx_sales_emp_region ON employee_sales(employee_id, region); -- Order matters! Index on (employee_id, region) helps: GROUP BY employee_id, region -- Yes GROUP BY employee_id -- Yes GROUP BY region -- No (can't skip columns)
Question 58: HAVING - Filtering Groups
Problem
Find users who have placed more than 5 orders.
Solution
sqlSELECT user_id, COUNT(*) AS order_count FROM orders GROUP BY user_id HAVING COUNT(*) > 5 ORDER BY order_count DESC;
Explanation
HAVING filters groups AFTER aggregation. WHERE filters rows BEFORE aggregation.
Execution Order: 1. FROM orders 2. WHERE (filter rows) 3. GROUP BY user_id 4. HAVING COUNT(*) > 5 (filter groups) 5. SELECT 6. ORDER BY
Common Mistakes
Mistake 1: Using WHERE for aggregate conditions
sql-- ERROR! SELECT user_id, COUNT(*) FROM orders WHERE COUNT(*) > 5 -- Can't use aggregate in WHERE GROUP BY user_id; -- Correct: use HAVING SELECT user_id, COUNT(*) FROM orders GROUP BY user_id HAVING COUNT(*) > 5;
Mistake 2: Using HAVING when WHERE would work
sql-- Inefficient: filters after grouping SELECT user_id, COUNT(*) FROM orders GROUP BY user_id HAVING user_id > 100; -- Better: filter before grouping (fewer rows to group) SELECT user_id, COUNT(*) FROM orders WHERE user_id > 100 GROUP BY user_id;
Performance Considerations
sql-- HAVING is applied after GROUP BY -- So all rows are still processed -- If possible, use WHERE to reduce rows BEFORE grouping -- WHERE filters: O(n) scan, then group smaller set -- HAVING filters: O(n) scan, group all, then filter EXPLAIN ANALYZE SELECT user_id, COUNT(*) FROM orders WHERE order_date > '2024-01-01' -- Reduces rows first GROUP BY user_id HAVING COUNT(*) > 5; -- Filters groups after
Question 59: Combining WHERE and HAVING
Problem
Find products with average rating above 4, considering only reviews from 2024.
Solution
sqlSELECT p.product_id, p.name, AVG(r.rating) AS avg_rating, COUNT(r.review_id) AS review_count FROM products p JOIN reviews r ON p.product_id = r.product_id WHERE r.created_at >= '2024-01-01' -- Filter rows GROUP BY p.product_id, p.name HAVING AVG(r.rating) > 4 -- Filter groups AND COUNT(r.review_id) >= 5 -- Minimum reviews ORDER BY avg_rating DESC;
Explanation
This query demonstrates the full flow:
- JOIN products and reviews
- WHERE filters to 2024 reviews only
- GROUP BY product
- HAVING filters to high-rated products with enough reviews
- ORDER BY rating
Common Mistakes
Mistake 1: Putting HAVING conditions in WHERE
sql-- Wrong order causes errors SELECT product_id, AVG(rating) FROM reviews WHERE AVG(rating) > 4 -- ERROR! GROUP BY product_id;
Mistake 2: Not understanding NULL in HAVING
sql-- Products with no reviews have NULL avg -- HAVING AVG(rating) > 4 excludes them automatically -- (NULL > 4 is UNKNOWN, not TRUE)
Question 60: Grouping by Expressions
Problem
Calculate total revenue by month.
Solution
sqlSELECT DATE_TRUNC('month', order_date) AS month, SUM(total_amount) AS monthly_revenue, COUNT(*) AS order_count FROM orders WHERE status = 'completed' GROUP BY DATE_TRUNC('month', order_date) ORDER BY month;
Explanation
You can GROUP BY any expression, not just columns. The database evaluates the expression for each row, then groups by result.
Common Mistakes
Mistake 1: Expression mismatch
sql-- SELECT and GROUP BY expressions must match exactly SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) FROM orders GROUP BY month; -- Some databases don't allow alias reference -- Safer: repeat expression SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) FROM orders GROUP BY DATE_TRUNC('month', order_date); -- Or use position SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) FROM orders GROUP BY 1; -- Group by first SELECT expression
Performance Considerations
sql-- Expression in GROUP BY can't use regular index -- Computed during scan -- Create expression index for better performance CREATE INDEX idx_orders_month ON orders(DATE_TRUNC('month', order_date)); -- Now the GROUP BY can use index EXPLAIN ANALYZE SELECT DATE_TRUNC('month', order_date), COUNT(*) FROM orders GROUP BY DATE_TRUNC('month', order_date);
Question 61: ROLLUP - Subtotals and Grand Totals
Problem
Show sales by region with subtotals and grand total.
Solution
sqlSELECT COALESCE(region, 'TOTAL') AS region, SUM(amount) AS total_sales, COUNT(*) AS sale_count FROM employee_sales GROUP BY ROLLUP(region) ORDER BY CASE WHEN region IS NULL THEN 1 ELSE 0 END, region;
Explanation
ROLLUP creates grouping sets that include subtotals:
GROUP BY ROLLUP(region)produces:- Groups for each region
- Grand total (all regions combined)
Result: region | total_sales | sale_count North | 50000 | 120 South | 45000 | 100 West | 60000 | 150 TOTAL | 155000 | 370
For multiple columns:
sqlGROUP BY ROLLUP(region, employee_id) -- Produces: -- (region, employee_id) - per employee per region -- (region) - subtotal per region -- () - grand total
Common Mistakes
Mistake 1: Not distinguishing NULL data from subtotal rows
sql-- If region can be NULL in data, how to tell NULL region from subtotal? -- Use GROUPING() function SELECT region, SUM(amount), GROUPING(region) AS is_subtotal -- 1 if subtotal row, 0 if actual data FROM employee_sales GROUP BY ROLLUP(region);
Performance Considerations
ROLLUP adds extra grouping operations but is more efficient than UNION ALL of separate queries.
Question 62: CUBE - All Combinations
Problem
Analyze sales by both region and quarter with all subtotals.
Solution
sqlSELECT COALESCE(region, 'All Regions') AS region, COALESCE(DATE_TRUNC('quarter', sale_date)::TEXT, 'All Quarters') AS quarter, SUM(amount) AS total_sales FROM employee_sales GROUP BY CUBE(region, DATE_TRUNC('quarter', sale_date)) ORDER BY region, quarter;
Explanation
CUBE creates all possible combinations:
- produces:
GROUP BY CUBE(region, quarter)- (region, quarter) - per region per quarter
- (region) - per region, all quarters
- (quarter) - per quarter, all regions
- () - grand total
More combinations than ROLLUP but gives complete dimensional analysis.
When to Use
- OLAP-style reporting
- Creating pivot-table-like summaries
- When you need all possible subtotals
Question 63: GROUPING SETS - Custom Groups
Problem
Get sales totals by region, by employee, and overall (but not region+employee combination).
Solution
sqlSELECT region, employee_id, SUM(amount) AS total_sales FROM employee_sales GROUP BY GROUPING SETS ( (region), (employee_id), () ) ORDER BY region NULLS LAST, employee_id NULLS LAST;
Explanation
GROUPING SETS lets you specify exactly which groupings you want:
- More flexible than ROLLUP/CUBE
- Avoids computing unwanted combinations
Equivalent to: SELECT region, NULL, SUM(amount) FROM employee_sales GROUP BY region UNION ALL SELECT NULL, employee_id, SUM(amount) FROM employee_sales GROUP BY employee_id UNION ALL SELECT NULL, NULL, SUM(amount) FROM employee_sales
But more efficient (single scan).
Question 64: Window Functions Introduction - ROW_NUMBER
Problem
Rank users by their total order value.
Solution
sqlSELECT u.user_id, u.full_name, SUM(o.total_amount) AS total_spent, ROW_NUMBER() OVER (ORDER BY SUM(o.total_amount) DESC) AS rank FROM users u JOIN orders o ON u.user_id = o.user_id WHERE o.status = 'completed' GROUP BY u.user_id, u.full_name ORDER BY rank;
Explanation
Window functions perform calculations across related rows WITHOUT collapsing them into one row like GROUP BY.
ROW_NUMBER() assigns sequential numbers:
user_id | total_spent | rank 5 | 10000 | 1 3 | 8500 | 2 8 | 8500 | 3 -- Different rank despite same amount 1 | 7200 | 4
Syntax:
function() OVER (ORDER BY ...)Common Mistakes
Mistake 1: Confusing with GROUP BY
sql-- This is NOT window function (collapses rows) SELECT user_id, SUM(total_amount) FROM orders GROUP BY user_id; -- This IS window function (keeps all rows) SELECT order_id, user_id, total_amount, SUM(total_amount) OVER (PARTITION BY user_id) AS user_total FROM orders;
Mistake 2: Using ROW_NUMBER for ties
sql-- ROW_NUMBER gives different ranks for same values -- Use RANK() or DENSE_RANK() if ties should match
Question 65: RANK and DENSE_RANK
Problem
Rank products by sales, handling ties appropriately.
Solution
sqlWITH product_sales AS ( SELECT p.product_id, p.name, SUM(oi.quantity) AS units_sold FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name ) SELECT product_id, name, units_sold, ROW_NUMBER() OVER (ORDER BY units_sold DESC) AS row_num, RANK() OVER (ORDER BY units_sold DESC) AS rank, DENSE_RANK() OVER (ORDER BY units_sold DESC) AS dense_rank FROM product_sales;
Explanation
Compare the three ranking functions:
units_sold | ROW_NUMBER | RANK | DENSE_RANK 100 | 1 | 1 | 1 95 | 2 | 2 | 2 95 | 3 | 2 | 2 -- Tie! 90 | 4 | 4 | 3 -- Key difference
- ROW_NUMBER: Always unique, arbitrary order for ties
- RANK: Same rank for ties, gaps after
- DENSE_RANK: Same rank for ties, no gaps
When to Use Which
- ROW_NUMBER: Pagination, unique ordering needed
- RANK: Competition ranking (1st, 2nd, 2nd, 4th)
- DENSE_RANK: When you need sequential values (1st, 2nd, 2nd, 3rd)
Question 66: PARTITION BY - Groups Within Window
Problem
Rank products within each category by price.
Solution
sqlSELECT category_id, product_id, name, price, RANK() OVER ( PARTITION BY category_id ORDER BY price DESC ) AS price_rank_in_category FROM products WHERE price IS NOT NULL ORDER BY category_id, price_rank_in_category;
Explanation
PARTITION BY divides rows into groups, and window function restarts for each group.
category_id | product | price | rank 1 | Laptop | 1500 | 1 1 | Tablet | 800 | 2 1 | Mouse | 50 | 3 2 | Shirt | 100 | 1 -- Rank restarts! 2 | Hat | 30 | 2
Without PARTITION BY, ranking would be across ALL products.
Common Mistakes
Mistake 1: Confusing PARTITION BY with GROUP BY
sql-- GROUP BY: Collapses rows, one output per group SELECT category_id, MAX(price) FROM products GROUP BY category_id; -- PARTITION BY: Keeps all rows, computes value for each SELECT category_id, product_id, price, MAX(price) OVER (PARTITION BY category_id) AS max_in_category FROM products;
Question 67: Running Totals with SUM OVER
Problem
Calculate running total of daily sales.
Solution
sqlSELECT sale_date, SUM(amount) AS daily_total, SUM(SUM(amount)) OVER (ORDER BY sale_date) AS running_total FROM employee_sales GROUP BY sale_date ORDER BY sale_date;
Explanation
SUM() OVER (ORDER BY ...) creates a running total:sale_date | daily_total | running_total 2024-01-01 | 1000 | 1000 2024-01-02 | 1500 | 2500 2024-01-03 | 800 | 3300 2024-01-04 | 2000 | 5300
The ORDER BY in OVER defines the running order. Each row includes sum of all previous rows.
Common Mistakes
Mistake 1: Not understanding default frame
sql-- With ORDER BY, default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- This means: sum from start to current row -- Without ORDER BY, default is all rows in partition SELECT SUM(amount) OVER () FROM sales; -- Same value for every row (grand total)
Performance Considerations
sql-- Running totals are efficient - one pass through data -- But can still be slow on huge datasets -- For very large tables, consider pre-computing: CREATE TABLE daily_running_totals AS SELECT sale_date, SUM(amount) AS daily_total, SUM(SUM(amount)) OVER (ORDER BY sale_date) AS running_total FROM employee_sales GROUP BY sale_date;
Question 68: Moving Averages with Window Frames
Problem
Calculate 7-day moving average of daily page views.
Solution
sqlWITH daily_views AS ( SELECT DATE(viewed_at) AS view_date, COUNT(*) AS daily_count FROM page_views GROUP BY DATE(viewed_at) ) SELECT view_date, daily_count, ROUND(AVG(daily_count) OVER ( ORDER BY view_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ), 2) AS moving_avg_7day FROM daily_views ORDER BY view_date;
Explanation
Window frames define exactly which rows to include in calculation:
sqlROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- Current row + 6 previous = 7 rows total -- Other frame options: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- All previous ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING -- 7-row centered ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- Current to end
Common Mistakes
Mistake 1: ROWS vs RANGE
sql-- ROWS: Exact number of rows ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- RANGE: Value-based (can include multiple rows with same value) RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW -- For dates with gaps, RANGE might be more appropriate
Mistake 2: Incomplete windows at edges
sql-- First 6 rows won't have full 7-day average -- Handle appropriately: SELECT view_date, daily_count, CASE WHEN COUNT(*) OVER (ORDER BY view_date ROWS 6 PRECEDING) < 7 THEN NULL -- Not enough data ELSE AVG(daily_count) OVER (ORDER BY view_date ROWS 6 PRECEDING) END AS moving_avg_7day FROM daily_views;
Question 69: LAG and LEAD - Accessing Other Rows
Problem
Compare each day's sales to the previous day.
Solution
sqlWITH daily_sales AS ( SELECT sale_date, SUM(amount) AS daily_total FROM employee_sales GROUP BY sale_date ) SELECT sale_date, daily_total, LAG(daily_total, 1, 0) OVER (ORDER BY sale_date) AS prev_day_total, daily_total - LAG(daily_total, 1, 0) OVER (ORDER BY sale_date) AS daily_change, ROUND( (daily_total - LAG(daily_total, 1) OVER (ORDER BY sale_date)) * 100.0 / NULLIF(LAG(daily_total, 1) OVER (ORDER BY sale_date), 0) , 2) AS percent_change FROM daily_sales ORDER BY sale_date;
Explanation
- LAG(column, n, default): Access nth row BEFORE current
- LEAD(column, n, default): Access nth row AFTER current
sale_date | daily_total | prev_day | change | % change 2024-01-01 | 1000 | 0 | 1000 | NULL 2024-01-02 | 1500 | 1000 | 500 | 50.00% 2024-01-03 | 1200 | 1500 | -300 | -20.00%
Common Mistakes
Mistake 1: Forgetting NULL for first/last rows
sql-- LAG returns NULL for first row (no previous) -- Handle with COALESCE or default value LAG(value, 1, 0) -- Third param is default value
Mistake 2: Wrong offset
sql-- LAG(value, 2) gets row 2 positions back, not 2nd column SELECT LAG(amount, 2) OVER (ORDER BY date) -- Amount from 2 days ago
Question 70: FIRST_VALUE and LAST_VALUE
Problem
For each order, show the customer's first ever purchase date.
Solution
sqlSELECT o.order_id, o.user_id, o.order_date, o.total_amount, FIRST_VALUE(o.order_date) OVER ( PARTITION BY o.user_id ORDER BY o.order_date ) AS first_purchase_date, o.order_date - FIRST_VALUE(o.order_date) OVER ( PARTITION BY o.user_id ORDER BY o.order_date ) AS days_since_first_purchase FROM orders o ORDER BY o.user_id, o.order_date;
Explanation
- FIRST_VALUE(column): Returns value from first row in window
- LAST_VALUE(column): Returns value from last row in window
Common Mistakes
Mistake 1: LAST_VALUE default frame issue
sql-- LAST_VALUE often surprises people: SELECT order_id, order_date, LAST_VALUE(order_date) OVER (ORDER BY order_date) AS last_date FROM orders; -- Returns current row's date, not actual last! -- Why? Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- Fix with explicit frame: SELECT order_id, LAST_VALUE(order_date) OVER ( ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS actual_last_date FROM orders;
Question 71: NTH_VALUE - Getting Specific Position
Problem
Find the second highest sale for each employee.
Solution
sqlSELECT DISTINCT employee_id, NTH_VALUE(amount, 2) OVER ( PARTITION BY employee_id ORDER BY amount DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_highest_sale FROM employee_sales;
Explanation
NTH_VALUE(column, n) returns value at nth position in ordered window.
Common Mistakes
Mistake 1: Same frame issue as LAST_VALUE
sql-- Must include full frame to access any position ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Mistake 2: NULL when fewer than n rows
sql-- If employee has only 1 sale, NTH_VALUE(amount, 2) returns NULL -- Handle appropriately
Question 72: NTILE - Creating Buckets
Problem
Divide customers into 4 spending tiers (quartiles).
Solution
sqlWITH customer_spending AS ( SELECT user_id, SUM(total_amount) AS total_spent FROM orders WHERE status = 'completed' GROUP BY user_id ) SELECT user_id, total_spent, NTILE(4) OVER (ORDER BY total_spent DESC) AS spending_quartile, CASE NTILE(4) OVER (ORDER BY total_spent DESC) WHEN 1 THEN 'Platinum' WHEN 2 THEN 'Gold' WHEN 3 THEN 'Silver' WHEN 4 THEN 'Bronze' END AS tier FROM customer_spending ORDER BY total_spent DESC;
Explanation
NTILE(n) divides rows into n roughly equal groups:
- NTILE(4): Quartiles (25% each)
- NTILE(10): Deciles (10% each)
- NTILE(100): Percentiles
total_spent | ntile | tier 50000 | 1 | Platinum 45000 | 1 | Platinum 35000 | 2 | Gold 30000 | 2 | Gold ...
Common Mistakes
Mistake 1: Uneven distribution
sql-- 10 rows into 4 buckets = 3, 3, 2, 2 distribution -- Not exactly 25% each! -- For exact percentiles, use PERCENT_RANK SELECT value, NTILE(4) OVER (ORDER BY value) AS quartile, PERCENT_RANK() OVER (ORDER BY value) AS percentile FROM data;
Question 73: PERCENT_RANK and CUME_DIST
Problem
Calculate percentile rank for each product's sales.
Solution
sqlWITH product_sales AS ( SELECT p.product_id, p.name, SUM(oi.quantity * oi.unit_price) AS total_revenue FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name ) SELECT product_id, name, total_revenue, ROUND(PERCENT_RANK() OVER (ORDER BY total_revenue) * 100, 2) AS percentile, ROUND(CUME_DIST() OVER (ORDER BY total_revenue) * 100, 2) AS cumulative_pct FROM product_sales ORDER BY total_revenue DESC;
Explanation
-
PERCENT_RANK(): (rank - 1) / (total_rows - 1)
- First row = 0, last row = 1
- "What percentage of rows are below this?"
-
CUME_DIST(): rank / total_rows
- "What percentage of rows are at or below this?"
revenue | PERCENT_RANK | CUME_DIST 100 | 0.00 | 0.25 -- 25% at or below 200 | 0.33 | 0.50 -- 50% at or below 300 | 0.67 | 0.75 400 | 1.00 | 1.00
When to Use
- Identifying top/bottom performers
- Statistical analysis
- Relative positioning
Question 74: Aggregate Window Functions
Problem
Show each order with the customer's total spending and order count.
Solution
sqlSELECT o.order_id, o.user_id, o.order_date, o.total_amount, SUM(o.total_amount) OVER (PARTITION BY o.user_id) AS customer_lifetime_value, COUNT(*) OVER (PARTITION BY o.user_id) AS customer_order_count, ROUND(o.total_amount * 100.0 / SUM(o.total_amount) OVER (PARTITION BY o.user_id), 2) AS pct_of_customer_total FROM orders o WHERE o.status = 'completed' ORDER BY o.user_id, o.order_date;
Explanation
Regular aggregates (SUM, COUNT, AVG, etc.) can be used as window functions:
order_id | user_id | amount | customer_total | order_count | pct 1 | 1 | 100 | 450 | 3 | 22.22% 5 | 1 | 150 | 450 | 3 | 33.33% 8 | 1 | 200 | 450 | 3 | 44.44% 2 | 2 | 300 | 500 | 2 | 60.00% 4 | 2 | 200 | 500 | 2 | 40.00%
Key difference from GROUP BY:
- GROUP BY: One row per group
- Window: All rows retained, aggregate value repeated
Question 75: Complex Window Expressions
Problem
For each sale, calculate:
- Running total for the employee that month
- Rank within their team
- Difference from team's best performer
Solution
sqlWITH monthly_sales AS ( SELECT employee_id, region, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS monthly_total FROM employee_sales GROUP BY employee_id, region, DATE_TRUNC('month', sale_date) ) SELECT employee_id, region, month, monthly_total, -- Running total for employee SUM(monthly_total) OVER ( PARTITION BY employee_id ORDER BY month ) AS employee_ytd, -- Rank within region for that month RANK() OVER ( PARTITION BY region, month ORDER BY monthly_total DESC ) AS rank_in_region, -- Best in region that month MAX(monthly_total) OVER ( PARTITION BY region, month ) AS region_best, -- Gap from best MAX(monthly_total) OVER ( PARTITION BY region, month ) - monthly_total AS gap_from_best, -- Percentage of region total ROUND( monthly_total * 100.0 / SUM(monthly_total) OVER (PARTITION BY region, month) , 2) AS pct_of_region FROM monthly_sales ORDER BY month, region, rank_in_region;
Explanation
This combines multiple window calculations in one query:
- Different PARTITION BY for different metrics
- Running totals with ORDER BY
- Comparisons to group maximums
- Percentage calculations within groups
Performance Considerations
sql-- Multiple window functions can share sorts -- Database optimizes when PARTITION BY and ORDER BY match EXPLAIN ANALYZE SELECT id, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC), RANK() OVER (PARTITION BY region ORDER BY amount DESC), SUM(amount) OVER (PARTITION BY region) FROM sales; -- These share same partition/sort, efficient! -- Different partitions require separate sorts SELECT id, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount), ROW_NUMBER() OVER (PARTITION BY employee ORDER BY date) FROM sales; -- Two different sorts needed, less efficient
Summary: Aggregation Cheat Sheet
Basic Aggregates
| Function | Purpose | NULL Handling |
|---|---|---|
| COUNT(*) | Count all rows | Counts NULLs |
| COUNT(col) | Count non-NULL | Ignores NULLs |
| SUM(col) | Total | Ignores NULLs |
| AVG(col) | Average | Ignores NULLs |
| MIN/MAX | Extremes | Ignores NULLs |
GROUP BY Variants
| Syntax | Result |
|---|---|
| GROUP BY a | Groups by a |
| GROUP BY a, b | Groups by combination |
| GROUP BY ROLLUP(a, b) | Groups + subtotals |
| GROUP BY CUBE(a, b) | All combinations |
| GROUP BY GROUPING SETS(...) | Custom groups |
Window Function Anatomy
sqlfunction() OVER ( PARTITION BY columns -- Reset for each group ORDER BY columns -- Order within partition frame_clause -- Which rows to include )
Common Window Functions
| Function | Purpose |
|---|---|
| ROW_NUMBER() | Unique sequential numbers |
| RANK() | Ranking with gaps |
| DENSE_RANK() | Ranking without gaps |
| NTILE(n) | Divide into n buckets |
| LAG(col, n) | Previous nth row |
| LEAD(col, n) | Next nth row |
| FIRST_VALUE() | First in window |
| LAST_VALUE() | Last in window |
| PERCENT_RANK() | Percentile (0-1) |
Frame Clauses
sqlROWS BETWEEN start AND end start/end options: - UNBOUNDED PRECEDING - n PRECEDING - CURRENT ROW - n FOLLOWING - UNBOUNDED FOLLOWING
What's Next?
In Part 4: Subqueries and CTEs, we'll cover:
- Scalar subqueries
- Correlated subqueries
- EXISTS vs IN performance
- Common Table Expressions (CTEs)
- Recursive CTEs for hierarchies
- Subquery optimization
Keep practicing these aggregation patterns - they're essential for data analysis and reporting!