SQL Mastery Guide - Part 4: Subqueries and CTEs
Introduction
Welcome to Part 4 of our SQL Mastery series. In Parts 1-3, we covered SELECT fundamentals, JOINs, and aggregations. Now we explore subqueries and Common Table Expressions (CTEs) - powerful techniques for building complex queries from simpler pieces.
Subqueries let you nest queries inside other queries. CTEs provide named, reusable query blocks. Together, they're essential for solving real-world data problems.
This part covers Questions 76-100, completing our 100-question journey.
Schema Reference
We continue with our e-commerce database:
sqlCREATE TABLE users ( user_id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL, full_name VARCHAR(100), created_at TIMESTAMP DEFAULT NOW(), country VARCHAR(50), is_active BOOLEAN DEFAULT true ); CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, price DECIMAL(10,2), category_id INT, stock_quantity INT DEFAULT 0, created_at TIMESTAMP DEFAULT NOW() ); CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), order_date TIMESTAMP DEFAULT NOW(), status VARCHAR(20) DEFAULT 'pending', total_amount DECIMAL(10,2) ); CREATE TABLE order_items ( item_id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(order_id), product_id INT REFERENCES products(product_id), quantity INT NOT NULL, unit_price DECIMAL(10,2) NOT NULL ); CREATE TABLE reviews ( review_id SERIAL PRIMARY KEY, product_id INT REFERENCES products(product_id), user_id INT REFERENCES users(user_id), rating INT CHECK (rating BETWEEN 1 AND 5), created_at TIMESTAMP DEFAULT NOW() ); -- Hierarchical table for CTE examples CREATE TABLE employees ( employee_id SERIAL PRIMARY KEY, name VARCHAR(100), manager_id INT REFERENCES employees(employee_id), department VARCHAR(50), salary DECIMAL(10,2), hire_date DATE ); CREATE TABLE categories ( category_id SERIAL PRIMARY KEY, name VARCHAR(100), parent_category_id INT REFERENCES categories(category_id) );
Question 76: Scalar Subquery in SELECT
Problem
For each product, show its price and how it compares to the average price.
Solution
sqlSELECT product_id, name, price, (SELECT AVG(price) FROM products) AS avg_price, price - (SELECT AVG(price) FROM products) AS diff_from_avg FROM products WHERE price IS NOT NULL ORDER BY diff_from_avg DESC;
Explanation
A scalar subquery returns exactly one value (one row, one column). It can be used wherever a single value is expected.
How it works:
- Inner query executes once
SELECT AVG(price) FROM products - Result (e.g., 75.50) substitutes into outer query
- Outer query runs with that constant value
Common Mistakes
Mistake 1: Subquery returns multiple rows
sql-- ERROR! Subquery returns multiple values SELECT name, (SELECT price FROM products) FROM products; -- "Subquery returns more than one row" -- Fix: Ensure single value with aggregate or LIMIT SELECT name, (SELECT MAX(price) FROM products) FROM products;
Mistake 2: Subquery returns multiple columns
sql-- ERROR! Subquery has multiple columns SELECT name, (SELECT MIN(price), MAX(price) FROM products) FROM products; -- Fix: One subquery per column SELECT name, (SELECT MIN(price) FROM products) AS min_price, (SELECT MAX(price) FROM products) AS max_price FROM products;
Performance Considerations
sql-- Scalar subquery in SELECT executes for EVERY row -- This is inefficient for row-dependent subqueries: SELECT name, (SELECT COUNT(*) FROM order_items WHERE product_id = p.product_id) FROM products p; -- Subquery runs once per product! -- Better: Use JOIN or window function SELECT p.name, COUNT(oi.item_id) AS order_count FROM products p LEFT JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name;
Question 77: Subquery in WHERE - Single Value
Problem
Find products priced above average.
Solution
sqlSELECT product_id, name, price FROM products WHERE price > (SELECT AVG(price) FROM products) ORDER BY price DESC;
Explanation
Subquery in WHERE filters rows based on dynamically computed value.
Execution:
- Subquery computes AVG(price) = 75.50
- Outer query becomes
WHERE price > 75.50 - Filter and return results
Common Mistakes
Mistake 1: Comparing to multi-row subquery with operators
sql-- ERROR if subquery returns multiple rows SELECT * FROM products WHERE price > (SELECT price FROM products WHERE category_id = 1); -- Fix: Use ANY/ALL for multi-row SELECT * FROM products WHERE price > ALL (SELECT price FROM products WHERE category_id = 1);
Question 78: IN Subquery
Problem
Find users who have placed at least one order.
Solution
sqlSELECT user_id, email, full_name FROM users WHERE user_id IN (SELECT DISTINCT user_id FROM orders) ORDER BY full_name;
Explanation
IN checks if value exists in the subquery result set.
sql-- Equivalent to: WHERE user_id = 1 OR user_id = 5 OR user_id = 12 ...
Common Mistakes
Mistake 1: IN with NULL values
sql-- If subquery contains NULL, unexpected results SELECT * FROM products WHERE category_id IN (SELECT category_id FROM special_categories); -- If subquery returns (1, 2, NULL), products with category_id = 3 are correctly excluded -- But NOT IN with NULL is problematic (see next question)
Performance Considerations
sql-- IN subquery often converted to semi-join by optimizer EXPLAIN ANALYZE SELECT * FROM users WHERE user_id IN (SELECT user_id FROM orders); -- Shows: Hash Semi Join (efficient) -- Not: Subquery executed for each row -- For large IN lists from subqueries, consider JOIN: SELECT DISTINCT u.* FROM users u JOIN orders o ON u.user_id = o.user_id;
Question 79: NOT IN Subquery - The NULL Trap
Problem
Find users who have never placed an order.
Solution
sql-- Safe version (handles NULLs) SELECT user_id, email, full_name FROM users WHERE user_id NOT IN ( SELECT user_id FROM orders WHERE user_id IS NOT NULL -- Critical! ); -- Or better, use NOT EXISTS (see Question 82)
Explanation
NOT IN has a dangerous NULL trap:
sql-- If orders.user_id contains any NULL: SELECT * FROM users WHERE user_id NOT IN (SELECT user_id FROM orders); -- Returns ZERO rows! Even for users who never ordered. -- Why? -- NOT IN (1, 2, NULL) means: -- user_id != 1 AND user_id != 2 AND user_id != NULL -- user_id != NULL is UNKNOWN, not TRUE -- UNKNOWN AND anything = UNKNOWN -- So WHERE clause is never TRUE
Common Mistakes
Mistake 1: Forgetting NULL in subquery
sql-- Dangerous: Will return nothing if any NULL in subquery SELECT * FROM users WHERE user_id NOT IN (SELECT user_id FROM orders); -- Safe alternatives: -- 1. Filter NULLs WHERE user_id NOT IN (SELECT user_id FROM orders WHERE user_id IS NOT NULL) -- 2. Use NOT EXISTS (recommended) WHERE NOT EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.user_id) -- 3. Use LEFT JOIN + NULL check SELECT u.* FROM users u LEFT JOIN orders o ON u.user_id = o.user_id WHERE o.order_id IS NULL;
Performance Considerations
sql-- NOT IN can be slow - often becomes anti-join or filter -- NOT EXISTS is usually better optimized -- LEFT JOIN with NULL check often fastest for large tables EXPLAIN ANALYZE SELECT * FROM users WHERE NOT EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.user_id); -- Shows: Hash Anti Join (efficient)
Question 80: ANY and ALL
Problem
Find products more expensive than any product in category 1.
Solution
sql-- ANY: More expensive than at least one category-1 product SELECT product_id, name, price FROM products WHERE price > ANY ( SELECT price FROM products WHERE category_id = 1 ); -- ALL: More expensive than every category-1 product SELECT product_id, name, price FROM products WHERE price > ALL ( SELECT price FROM products WHERE category_id = 1 );
Explanation
- > ANY (subquery): Greater than at least one value (same as > MIN)
- > ALL (subquery): Greater than every value (same as > MAX)
- = ANY (subquery): Same as IN
sql-- Equivalents: > ANY (1, 5, 10) -- Same as > 1 (the minimum) > ALL (1, 5, 10) -- Same as > 10 (the maximum) < ANY (1, 5, 10) -- Same as < 10 (the maximum) < ALL (1, 5, 10) -- Same as < 1 (the minimum)
Common Mistakes
Mistake 1: Confusing ANY and ALL
sql-- "More expensive than any product" means > MIN (lowest bar) -- "More expensive than all products" means > MAX (highest bar) -- Usually you want ALL for "beat everyone" scenarios
Mistake 2: Empty subquery behavior
sql-- If subquery returns no rows: > ALL (empty) -- TRUE (vacuously true, no values to beat) > ANY (empty) -- FALSE (no values to compare to)
Question 81: EXISTS Subquery
Problem
Find products that have at least one review.
Solution
sqlSELECT p.product_id, p.name FROM products p WHERE EXISTS ( SELECT 1 FROM reviews r WHERE r.product_id = p.product_id );
Explanation
EXISTS returns TRUE if subquery returns at least one row, FALSE otherwise.
Key characteristics:
- Stops at first match (short-circuit)
- Doesn't care about values, only existence
- Usually correlates with outer query
Common Mistakes
Mistake 1: Using SELECT * in EXISTS
sql-- Works but misleading - columns don't matter WHERE EXISTS (SELECT * FROM reviews WHERE ...) -- Better: Use SELECT 1 to be clear WHERE EXISTS (SELECT 1 FROM reviews WHERE ...)
Mistake 2: Non-correlated EXISTS
sql-- This EXISTS returns same result for every row WHERE EXISTS (SELECT 1 FROM reviews WHERE rating > 3) -- Either TRUE for all rows or FALSE for all rows -- Usually you want correlated: WHERE EXISTS (SELECT 1 FROM reviews WHERE reviews.product_id = products.product_id)
Performance Considerations
sql-- EXISTS is optimized to stop at first match -- For checking existence, EXISTS beats COUNT -- Bad: WHERE (SELECT COUNT(*) FROM reviews WHERE product_id = p.product_id) > 0 -- Counts all reviews unnecessarily -- Good: WHERE EXISTS (SELECT 1 FROM reviews WHERE product_id = p.product_id) -- Stops at first review found
Question 82: NOT EXISTS - Finding Missing Relations
Problem
Find users who have never written a review.
Solution
sqlSELECT u.user_id, u.full_name, u.email FROM users u WHERE NOT EXISTS ( SELECT 1 FROM reviews r WHERE r.user_id = u.user_id );
Explanation
NOT EXISTS returns TRUE if subquery returns zero rows.
This is the safest way to find "missing" relationships - no NULL trap like NOT IN.
Common Mistakes
Mistake 1: Using NOT IN instead
sql-- Dangerous if reviews.user_id can be NULL SELECT * FROM users WHERE user_id NOT IN (SELECT user_id FROM reviews); -- Safe SELECT * FROM users u WHERE NOT EXISTS (SELECT 1 FROM reviews r WHERE r.user_id = u.user_id);
Performance Considerations
sql-- NOT EXISTS typically becomes Anti Semi Join EXPLAIN ANALYZE SELECT * FROM users u WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.user_id); -- Ensure index on the correlated column CREATE INDEX idx_orders_user ON orders(user_id);
Question 83: Correlated Subquery
Problem
Find each customer's most recent order.
Solution
sqlSELECT o.order_id, o.user_id, o.order_date, o.total_amount FROM orders o WHERE o.order_date = ( SELECT MAX(o2.order_date) FROM orders o2 WHERE o2.user_id = o.user_id );
Explanation
A correlated subquery references columns from the outer query. It executes once per outer row.
Execution:
- For user_id = 1: Find MAX(order_date) where user_id = 1
- Check if current row's date matches
- Repeat for each order row
Common Mistakes
Mistake 1: Thinking it's efficient
sql-- Correlated subqueries execute per row = potentially slow -- For 10,000 orders, subquery runs 10,000 times -- Often better to rewrite as JOIN: SELECT o.* FROM orders o JOIN ( SELECT user_id, MAX(order_date) AS max_date FROM orders GROUP BY user_id ) latest ON o.user_id = latest.user_id AND o.order_date = latest.max_date; -- Or use window function: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) AS rn FROM orders ) sub WHERE rn = 1;
Performance Considerations
sql-- Correlated subqueries need careful indexing -- Ensure index on correlation columns CREATE INDEX idx_orders_user_date ON orders(user_id, order_date); -- Check execution plan EXPLAIN ANALYZE SELECT ...; -- Look for: "SubPlan" or "InitPlan" indicators
Question 84: Subquery in FROM (Derived Table)
Problem
Calculate the percentage of total revenue each category represents.
Solution
sqlSELECT category_revenue.category_id, c.name AS category_name, category_revenue.revenue, total.total_revenue, ROUND(category_revenue.revenue * 100.0 / total.total_revenue, 2) AS pct_of_total FROM ( SELECT p.category_id, SUM(oi.quantity * oi.unit_price) AS revenue FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.category_id ) AS category_revenue CROSS JOIN ( SELECT SUM(quantity * unit_price) AS total_revenue FROM order_items ) AS total JOIN categories c ON category_revenue.category_id = c.category_id ORDER BY revenue DESC;
Explanation
A subquery in FROM creates a derived table (or inline view). It's treated like a regular table.
Use cases:
- Pre-aggregating data before joining
- Creating intermediate results
- Simplifying complex queries
Common Mistakes
Mistake 1: Forgetting alias
sql-- ERROR: Derived table needs alias SELECT * FROM (SELECT product_id FROM products); -- Correct SELECT * FROM (SELECT product_id FROM products) AS derived;
Mistake 2: Selecting columns not in derived table
sql-- ERROR: 'name' not available SELECT name FROM (SELECT product_id FROM products) AS p; -- Derived table only has product_id
Performance Considerations
sql-- Derived tables can be materialized or merged into outer query -- Check EXPLAIN to see what optimizer does -- Tip: For reused derived tables, consider CTE (more readable)
Question 85: Common Table Expression (CTE) Basics
Problem
Find users whose total spending exceeds the average total spending per user.
Solution
sqlWITH user_spending AS ( SELECT user_id, SUM(total_amount) AS total_spent FROM orders WHERE status = 'completed' GROUP BY user_id ), avg_spending AS ( SELECT AVG(total_spent) AS avg_total FROM user_spending ) SELECT us.user_id, u.full_name, us.total_spent, avs.avg_total FROM user_spending us CROSS JOIN avg_spending avs JOIN users u ON us.user_id = u.user_id WHERE us.total_spent > avs.avg_total ORDER BY us.total_spent DESC;
Explanation
CTE (Common Table Expression) is a named temporary result set defined with
WITH.Benefits:
- Improves readability
- Can be referenced multiple times
- Breaks complex queries into logical steps
Syntax:
sqlWITH cte_name AS ( SELECT ... ), another_cte AS ( SELECT ... FROM cte_name -- Can reference previous CTE ) SELECT ... FROM cte_name JOIN another_cte ...;
Common Mistakes
Mistake 1: Trying to use CTE after semicolon
sql-- ERROR: CTE scope ends at semicolon WITH my_cte AS (...); SELECT * FROM my_cte; -- CTE not found! -- Correct: No semicolon before final SELECT WITH my_cte AS (...) SELECT * FROM my_cte;
Mistake 2: Referencing CTE before it's defined
sql-- ERROR: cte2 references cte1 before it exists WITH cte2 AS (SELECT * FROM cte1), cte1 AS (SELECT 1) SELECT * FROM cte2; -- Correct order: WITH cte1 AS (SELECT 1), cte2 AS (SELECT * FROM cte1) SELECT * FROM cte2;
Performance Considerations
sql-- CTE behavior varies by database: -- PostgreSQL < 12: CTEs are optimization barriers (always materialized) -- PostgreSQL >= 12: Optimizer can inline CTEs -- MySQL 8.0+: CTEs supported, optimizer can inline -- SQL Server: Inline by default -- Force materialization when needed: WITH cte AS MATERIALIZED (...) -- PostgreSQL 12+
Question 86: Multiple CTEs
Problem
Create a sales report showing daily, weekly, and monthly summaries.
Solution
sqlWITH daily_sales AS ( SELECT DATE(order_date) AS sale_date, SUM(total_amount) AS daily_total FROM orders WHERE status = 'completed' GROUP BY DATE(order_date) ), weekly_sales AS ( SELECT DATE_TRUNC('week', sale_date) AS week_start, SUM(daily_total) AS weekly_total FROM daily_sales GROUP BY DATE_TRUNC('week', sale_date) ), monthly_sales AS ( SELECT DATE_TRUNC('month', sale_date) AS month_start, SUM(daily_total) AS monthly_total FROM daily_sales GROUP BY DATE_TRUNC('month', sale_date) ) SELECT d.sale_date, d.daily_total, w.weekly_total, m.monthly_total, ROUND(d.daily_total * 100.0 / m.monthly_total, 2) AS pct_of_month FROM daily_sales d JOIN weekly_sales w ON DATE_TRUNC('week', d.sale_date) = w.week_start JOIN monthly_sales m ON DATE_TRUNC('month', d.sale_date) = m.month_start ORDER BY d.sale_date;
Explanation
Multiple CTEs create a pipeline:
daily_sales: Base aggregationweekly_sales: Built from daily_salesmonthly_sales: Also from daily_sales- Final SELECT: Combines all levels
This is much cleaner than nested subqueries.
Question 87: CTE vs Subquery - When to Use Which
Problem
Demonstrate equivalent query with CTE and subquery.
Solution
sql-- Subquery version (nested, harder to read) SELECT p.product_id, p.name, p.price, stats.avg_price, stats.max_price FROM products p CROSS JOIN ( SELECT AVG(price) AS avg_price, MAX(price) AS max_price FROM products ) AS stats WHERE p.price > stats.avg_price; -- CTE version (flat, easier to read) WITH product_stats AS ( SELECT AVG(price) AS avg_price, MAX(price) AS max_price FROM products ) SELECT p.product_id, p.name, p.price, ps.avg_price, ps.max_price FROM products p CROSS JOIN product_stats ps WHERE p.price > ps.avg_price;
Explanation
Use CTE when:
- Query logic has multiple steps
- Same derived table needed multiple times
- Recursion required
- Readability is priority
Use Subquery when:
- Simple, one-time use
- Performance critical (subqueries may optimize better)
- CTE not supported (older databases)
Performance Considerations
sql-- CTE referenced twice vs subquery twice: -- CTE (potentially computed once, stored) WITH expensive_calc AS ( SELECT ... -- complex calculation ) SELECT * FROM expensive_calc UNION ALL SELECT * FROM expensive_calc; -- Subquery (may compute twice) SELECT * FROM (SELECT ... ) AS t1 UNION ALL SELECT * FROM (SELECT ... ) AS t2; -- Modern optimizers often handle both efficiently -- Test with EXPLAIN ANALYZE for your specific case
Question 88: Recursive CTE - Employee Hierarchy
Problem
Build complete management hierarchy for each employee.
Solution
sqlWITH RECURSIVE employee_hierarchy AS ( -- Base case: Top-level managers (no manager) SELECT employee_id, name, manager_id, 1 AS level, name AS hierarchy_path FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive case: Employees with managers SELECT e.employee_id, e.name, e.manager_id, eh.level + 1, eh.hierarchy_path || ' > ' || e.name FROM employees e JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id ) SELECT employee_id, name, level, hierarchy_path FROM employee_hierarchy ORDER BY hierarchy_path;
Explanation
Recursive CTE processes hierarchical or graph data:
- Anchor member: Starting point (employees with no manager)
- Recursive member: References the CTE itself, linked via JOIN
- UNION ALL: Combines anchor and recursive results
- Termination: Stops when recursive member returns no rows
Iteration 1 (anchor): CEO Iteration 2: VP1, VP2 (report to CEO) Iteration 3: Managers (report to VPs) Iteration 4: Staff (report to managers) ... until no more employees found
Common Mistakes
Mistake 1: Infinite recursion
sql-- Missing termination condition WITH RECURSIVE bad AS ( SELECT 1 AS n UNION ALL SELECT n + 1 FROM bad -- Never stops! ) SELECT * FROM bad; -- Add termination: WITH RECURSIVE safe AS ( SELECT 1 AS n UNION ALL SELECT n + 1 FROM safe WHERE n < 100 -- Stops at 100 ) SELECT * FROM safe;
Mistake 2: Using UNION instead of UNION ALL
sql-- UNION removes duplicates, may cause issues WITH RECURSIVE hierarchy AS ( ... UNION -- Might eliminate legitimate duplicate values ... ) -- Usually use UNION ALL for recursive CTEs
Performance Considerations
sql-- Index on the recursive join column CREATE INDEX idx_employees_manager ON employees(manager_id); -- Set recursion limit (PostgreSQL) SET max_recursive_iterations = 100; -- Watch for exponential growth in badly structured data
Question 89: Recursive CTE - Category Tree
Problem
Show all categories with their full path (e.g., "Electronics > Computers > Laptops").
Solution
sqlWITH RECURSIVE category_tree AS ( -- Root categories (no parent) SELECT category_id, name, parent_category_id, name AS full_path, 1 AS depth FROM categories WHERE parent_category_id IS NULL UNION ALL -- Child categories SELECT c.category_id, c.name, c.parent_category_id, ct.full_path || ' > ' || c.name, ct.depth + 1 FROM categories c JOIN category_tree ct ON c.parent_category_id = ct.category_id ) SELECT category_id, name, full_path, depth FROM category_tree ORDER BY full_path;
Explanation
This builds hierarchical paths:
category_id | name | full_path | depth 1 | Electronics| Electronics | 1 2 | Computers | Electronics > Computers | 2 3 | Laptops | Electronics > Computers > Laptops | 3 4 | Desktops | Electronics > Computers > Desktops | 3 5 | Clothing | Clothing | 1
Useful for:
- Breadcrumb navigation
- Sitemap generation
- Hierarchical dropdowns
Question 90: Recursive CTE - Finding All Descendants
Problem
Find all employees who report to a specific manager (at any level).
Solution
sqlWITH RECURSIVE all_reports AS ( -- Direct reports to manager 5 SELECT employee_id, name, manager_id, 1 AS report_level FROM employees WHERE manager_id = 5 UNION ALL -- Reports of reports (recursive) SELECT e.employee_id, e.name, e.manager_id, ar.report_level + 1 FROM employees e JOIN all_reports ar ON e.manager_id = ar.employee_id ) SELECT * FROM all_reports ORDER BY report_level, name;
Explanation
Starting from manager_id = 5:
- Find all direct reports
- Find all their reports
- Continue until no more reports
Useful for:
- Cascade delete checks
- Permission hierarchies
- Organizational reporting
Question 91: Recursive CTE - Generating Series
Problem
Generate a date series for the last 30 days (without generate_series function).
Solution
sqlWITH RECURSIVE date_series AS ( SELECT CURRENT_DATE - INTERVAL '29 days' AS date_val UNION ALL SELECT date_val + INTERVAL '1 day' FROM date_series WHERE date_val < CURRENT_DATE ) SELECT date_val::DATE AS date FROM date_series;
Explanation
This generates dates without relying on built-in functions:
date_val 2024-01-01 2024-01-02 ... 2024-01-30
Use cases:
- Fill gaps in time series data
- Create calendar tables
- Generate test data
Alternative with generate_series (PostgreSQL)
sql-- If available, use built-in function SELECT generate_series( CURRENT_DATE - INTERVAL '29 days', CURRENT_DATE, INTERVAL '1 day' )::DATE AS date;
Question 92: Subquery for Row-by-Row Calculations
Problem
For each product, show units sold and revenue, compared to category average.
Solution
sqlSELECT p.product_id, p.name, p.category_id, COALESCE(SUM(oi.quantity), 0) AS units_sold, COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS revenue, ( SELECT ROUND(AVG(sub_revenue), 2) FROM ( SELECT SUM(oi2.quantity * oi2.unit_price) AS sub_revenue FROM products p2 LEFT JOIN order_items oi2 ON p2.product_id = oi2.product_id WHERE p2.category_id = p.category_id GROUP BY p2.product_id ) category_products ) AS category_avg_revenue FROM products p LEFT JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name, p.category_id ORDER BY revenue DESC;
Explanation
The correlated subquery calculates average revenue for products in the same category.
Better approach using CTE:
sqlWITH product_revenue AS ( SELECT p.product_id, p.name, p.category_id, COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS revenue FROM products p LEFT JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name, p.category_id ), category_avg AS ( SELECT category_id, ROUND(AVG(revenue), 2) AS avg_revenue FROM product_revenue GROUP BY category_id ) SELECT pr.*, ca.avg_revenue AS category_avg_revenue, pr.revenue - ca.avg_revenue AS diff_from_category_avg FROM product_revenue pr JOIN category_avg ca ON pr.category_id = ca.category_id ORDER BY pr.revenue DESC;
Question 93: Lateral Join (Correlated Subquery in FROM)
Problem
For each user, get their 3 most recent orders.
Solution
sql-- PostgreSQL / MySQL 8+ syntax SELECT u.user_id, u.full_name, recent_orders.order_id, recent_orders.order_date, recent_orders.total_amount FROM users u CROSS JOIN LATERAL ( SELECT order_id, order_date, total_amount FROM orders o WHERE o.user_id = u.user_id ORDER BY order_date DESC LIMIT 3 ) AS recent_orders ORDER BY u.user_id, recent_orders.order_date DESC;
Explanation
LATERAL allows subquery in FROM to reference columns from preceding tables.
Without LATERAL:
sql-- ERROR: Cannot reference u.user_id in subquery SELECT * FROM users u JOIN ( SELECT * FROM orders WHERE user_id = u.user_id -- Error! ) AS o ON true;
With LATERAL:
sql-- Works: Subquery can see u.user_id SELECT * FROM users u CROSS JOIN LATERAL ( SELECT * FROM orders WHERE user_id = u.user_id ) AS o;
Common Mistakes
Mistake 1: Using CROSS JOIN LATERAL with empty results
sql-- CROSS JOIN LATERAL excludes users with no matching orders -- Use LEFT JOIN LATERAL to include them: SELECT u.*, recent.* FROM users u LEFT JOIN LATERAL ( SELECT * FROM orders WHERE user_id = u.user_id LIMIT 3 ) AS recent ON true;
Performance Considerations
sql-- LATERAL is like correlated subquery in FROM clause -- Ensure proper indexes CREATE INDEX idx_orders_user_date ON orders(user_id, order_date DESC); -- Alternative: Window function approach WITH ranked_orders AS ( SELECT o.*, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) AS rn FROM orders o ) SELECT u.*, ro.* FROM users u JOIN ranked_orders ro ON u.user_id = ro.user_id AND ro.rn <= 3;
Question 94: Subquery in UPDATE
Problem
Update product prices to match the average price in their category.
Solution
sql-- Set each product's price to its category average UPDATE products p SET price = ( SELECT AVG(p2.price) FROM products p2 WHERE p2.category_id = p.category_id ) WHERE category_id IS NOT NULL; -- Better: Using CTE (PostgreSQL) WITH category_averages AS ( SELECT category_id, AVG(price) AS avg_price FROM products GROUP BY category_id ) UPDATE products p SET price = ca.avg_price FROM category_averages ca WHERE p.category_id = ca.category_id;
Explanation
Subqueries in UPDATE can:
- Set values based on calculations
- Reference the row being updated
- Join with other tables for complex logic
Common Mistakes
Mistake 1: Subquery returns multiple rows
sql-- ERROR if subquery returns multiple values UPDATE products SET price = (SELECT price FROM other_prices); -- Ensure single value per row UPDATE products p SET price = (SELECT op.price FROM other_prices op WHERE op.product_id = p.product_id);
Mistake 2: Not considering NULL
sql-- If subquery returns NULL, price becomes NULL UPDATE products SET price = (SELECT ... ); -- Use COALESCE for safety UPDATE products SET price = COALESCE((SELECT ...), price); -- Keep original if NULL
Question 95: Subquery in DELETE
Problem
Delete orders that have no order items.
Solution
sql-- Using NOT EXISTS (safest) DELETE FROM orders o WHERE NOT EXISTS ( SELECT 1 FROM order_items oi WHERE oi.order_id = o.order_id ); -- Using NOT IN (watch for NULLs) DELETE FROM orders WHERE order_id NOT IN ( SELECT DISTINCT order_id FROM order_items WHERE order_id IS NOT NULL ); -- Using LEFT JOIN (PostgreSQL) DELETE FROM orders o USING ( SELECT o2.order_id FROM orders o2 LEFT JOIN order_items oi ON o2.order_id = oi.order_id WHERE oi.item_id IS NULL ) AS empty_orders WHERE o.order_id = empty_orders.order_id;
Explanation
Similar to SELECT, DELETE can use subqueries to identify rows to remove.
Common Mistakes
Mistake 1: Deleting too much with poorly scoped subquery
sql-- Dangerous: Delete all orders if ANY order has no items DELETE FROM orders WHERE (SELECT COUNT(*) FROM order_items) = 0; -- Correct: Correlate the subquery DELETE FROM orders o WHERE (SELECT COUNT(*) FROM order_items oi WHERE oi.order_id = o.order_id) = 0;
Question 96: Subquery in INSERT
Problem
Insert top 10 customers into a VIP table.
Solution
sql-- Create VIP table CREATE TABLE vip_customers ( user_id INT PRIMARY KEY, full_name VARCHAR(100), total_spent DECIMAL(12,2), added_date TIMESTAMP DEFAULT NOW() ); -- Insert top spenders INSERT INTO vip_customers (user_id, full_name, total_spent) SELECT u.user_id, u.full_name, SUM(o.total_amount) AS total_spent FROM users u JOIN orders o ON u.user_id = o.user_id WHERE o.status = 'completed' GROUP BY u.user_id, u.full_name ORDER BY total_spent DESC LIMIT 10;
Explanation
INSERT ... SELECT inserts results of a query into a table.Use cases:
- Archiving data
- Creating summary tables
- ETL operations
- Populating denormalized tables
Common Mistakes
Mistake 1: Column mismatch
sql-- ERROR: Column count/types don't match INSERT INTO vip_customers (user_id, full_name, total_spent) SELECT user_id, full_name -- Missing third column! FROM users;
Mistake 2: Duplicate key violations
sql-- If VIP already exists, INSERT fails INSERT INTO vip_customers (user_id, ...) SELECT ... -- Might include existing user_id -- Use ON CONFLICT (PostgreSQL) or INSERT IGNORE (MySQL) INSERT INTO vip_customers (user_id, full_name, total_spent) SELECT ... ON CONFLICT (user_id) DO UPDATE SET total_spent = EXCLUDED.total_spent;
Question 97: CTE with INSERT/UPDATE/DELETE
Problem
Archive old orders and delete them in one operation.
Solution
sql-- PostgreSQL: CTE with data modification WITH archived AS ( INSERT INTO orders_archive (order_id, user_id, order_date, total_amount, archived_at) SELECT order_id, user_id, order_date, total_amount, NOW() FROM orders WHERE order_date < CURRENT_DATE - INTERVAL '2 years' RETURNING order_id ) DELETE FROM orders WHERE order_id IN (SELECT order_id FROM archived);
Explanation
PostgreSQL allows data-modifying CTEs:
- CTE inserts old orders into archive, returns IDs
- Main query deletes those orders from original table
- Both happen in same transaction
Common Mistakes
Mistake 1: Assuming all databases support this
sql-- This is PostgreSQL-specific -- MySQL/SQL Server need separate statements: INSERT INTO orders_archive SELECT ... FROM orders WHERE ...; DELETE FROM orders WHERE ...;
Question 98: Complex Business Query - Customer Cohort Analysis
Problem
Analyze customer retention by signup month (cohort analysis).
Solution
sqlWITH customer_cohorts AS ( SELECT user_id, DATE_TRUNC('month', created_at) AS cohort_month FROM users ), customer_activities AS ( SELECT o.user_id, DATE_TRUNC('month', o.order_date) AS activity_month FROM orders o WHERE o.status = 'completed' GROUP BY o.user_id, DATE_TRUNC('month', o.order_date) ), cohort_activity AS ( SELECT cc.cohort_month, ca.activity_month, COUNT(DISTINCT cc.user_id) AS active_users FROM customer_cohorts cc JOIN customer_activities ca ON cc.user_id = ca.user_id WHERE ca.activity_month >= cc.cohort_month GROUP BY cc.cohort_month, ca.activity_month ), cohort_sizes AS ( SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size FROM customer_cohorts GROUP BY cohort_month ) SELECT ca.cohort_month, cs.cohort_size, ca.activity_month, EXTRACT(MONTH FROM AGE(ca.activity_month, ca.cohort_month)) AS months_since_signup, ca.active_users, ROUND(ca.active_users * 100.0 / cs.cohort_size, 2) AS retention_pct FROM cohort_activity ca JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month ORDER BY ca.cohort_month, ca.activity_month;
Explanation
This creates a retention matrix:
- Cohort: Group of customers by signup month
- Activity: When they placed orders
- Retention: Percentage still active N months later
cohort | size | month_0 | month_1 | month_2 | ... 2024-01 | 100 | 100% | 45% | 32% | 2024-02 | 150 | 100% | 50% | 38% |
Use Cases
- Measuring product stickiness
- Identifying when customers churn
- Comparing cohort quality over time
Question 99: Complex Business Query - Funnel Analysis
Problem
Analyze conversion funnel: page view > add to cart > checkout > purchase.
Solution
sqlWITH page_views AS ( SELECT DISTINCT user_id, DATE(viewed_at) AS event_date FROM page_views WHERE page_url LIKE '/product/%' ), cart_adds AS ( SELECT DISTINCT user_id, DATE(created_at) AS event_date FROM cart_items ), checkouts AS ( SELECT DISTINCT user_id, DATE(created_at) AS event_date FROM checkout_sessions ), purchases AS ( SELECT DISTINCT user_id, DATE(order_date) AS event_date FROM orders WHERE status IN ('completed', 'processing') ), daily_funnel AS ( SELECT pv.event_date, COUNT(DISTINCT pv.user_id) AS viewed, COUNT(DISTINCT ca.user_id) AS added_to_cart, COUNT(DISTINCT co.user_id) AS checked_out, COUNT(DISTINCT pu.user_id) AS purchased FROM page_views pv LEFT JOIN cart_adds ca ON pv.user_id = ca.user_id AND pv.event_date = ca.event_date LEFT JOIN checkouts co ON ca.user_id = co.user_id AND ca.event_date = co.event_date LEFT JOIN purchases pu ON co.user_id = pu.user_id AND co.event_date = pu.event_date GROUP BY pv.event_date ) SELECT event_date, viewed, added_to_cart, ROUND(added_to_cart * 100.0 / NULLIF(viewed, 0), 2) AS view_to_cart_pct, checked_out, ROUND(checked_out * 100.0 / NULLIF(added_to_cart, 0), 2) AS cart_to_checkout_pct, purchased, ROUND(purchased * 100.0 / NULLIF(checked_out, 0), 2) AS checkout_to_purchase_pct, ROUND(purchased * 100.0 / NULLIF(viewed, 0), 2) AS overall_conversion_pct FROM daily_funnel ORDER BY event_date;
Explanation
Funnel analysis tracks user progression through stages:
event_date | viewed | cart | checkout | purchased | conversion 2024-01-15 | 1000 | 300 | 150 | 120 | 12%
Each stage shows:
- Absolute numbers
- Conversion rate from previous stage
- Overall conversion (end-to-end)
Question 100: Complex Business Query - Revenue Attribution
Problem
Attribute revenue to first-touch marketing channel.
Solution
sqlWITH first_touch AS ( -- Find first touchpoint for each user SELECT DISTINCT ON (user_id) user_id, channel, campaign, touch_date FROM marketing_touches ORDER BY user_id, touch_date ASC ), user_revenue AS ( -- Total revenue per user SELECT user_id, SUM(total_amount) AS lifetime_revenue, COUNT(*) AS order_count, MIN(order_date) AS first_order, MAX(order_date) AS last_order FROM orders WHERE status = 'completed' GROUP BY user_id ) SELECT ft.channel, ft.campaign, COUNT(DISTINCT ft.user_id) AS acquired_users, COUNT(DISTINCT ur.user_id) AS converted_users, ROUND(COUNT(DISTINCT ur.user_id) * 100.0 / COUNT(DISTINCT ft.user_id), 2) AS conversion_rate, COALESCE(SUM(ur.lifetime_revenue), 0) AS attributed_revenue, ROUND(COALESCE(AVG(ur.lifetime_revenue), 0), 2) AS avg_customer_value FROM first_touch ft LEFT JOIN user_revenue ur ON ft.user_id = ur.user_id GROUP BY ft.channel, ft.campaign ORDER BY attributed_revenue DESC;
Explanation
First-touch attribution credits revenue to the channel that first introduced the customer:
channel | campaign | users | converted | revenue | avg_value google | brand_2024 | 5000 | 800 | $120,000 | $150 facebook | retargeting | 3000 | 450 | $67,500 | $150 organic | NULL | 10000 | 1200 | $180,000 | $150
Use Cases
- Marketing ROI analysis
- Budget allocation
- Campaign optimization
Summary: Subqueries and CTEs Cheat Sheet
Subquery Types
| Type | Location | Returns | Example |
|---|---|---|---|
| Scalar | SELECT, WHERE | Single value | (SELECT MAX(price) FROM products) |
| Row | WHERE | Single row | WHERE (a, b) = (SELECT ...) |
| Table | FROM | Multiple rows/cols | FROM (SELECT ...) AS t |
| Correlated | Any | Depends on outer | WHERE price > (SELECT AVG(price) WHERE category_id = p.category_id) |
Subquery Operators
| Operator | Meaning |
|---|---|
| IN | Value in set |
| NOT IN | Value not in set (watch NULLs!) |
| EXISTS | Subquery returns rows |
| NOT EXISTS | Subquery returns no rows |
| ANY/SOME | Comparison to any value |
| ALL | Comparison to all values |
CTE Syntax
sqlWITH cte_name AS ( SELECT ... ), cte2 AS ( SELECT ... FROM cte_name ) SELECT ... FROM cte_name JOIN cte2 ...;
Recursive CTE Pattern
sqlWITH RECURSIVE cte AS ( -- Anchor: starting rows SELECT ... WHERE condition UNION ALL -- Recursive: reference CTE SELECT ... FROM table JOIN cte ON ... ) SELECT * FROM cte;
Performance Tips
- EXISTS vs IN: EXISTS often faster, no NULL issues
- NOT EXISTS vs NOT IN: Always prefer NOT EXISTS
- Correlated vs JOIN: JOIN usually faster for large datasets
- CTE materialization: Check if your database materializes or inlines
- LATERAL: Useful but like correlated subquery in FROM
Congratulations!
You've completed all 100 questions in the SQL Mastery Guide!
What you've learned:
- Part 1 (Q1-25): SELECT fundamentals, filtering, sorting
- Part 2 (Q26-50): All types of JOINs
- Part 3 (Q51-75): Aggregations and window functions
- Part 4 (Q76-100): Subqueries and CTEs
Next steps:
- Practice these patterns on real data
- Use EXPLAIN ANALYZE to understand query performance
- Learn your specific database's optimizations
- Build increasingly complex queries combining all techniques
Happy querying!