SQL tutorial

Assistant (initial reply)

Here are 5 SQL questions that often appear in Data Scientist Online Assessments (OAs), especially from companies like Meta, Amazon, Google, and fintech/startups. Each includes a short problem, a sample schema, and a clean, optimized SQL solution with explanations.

1. Find the Top 3 Products by Sales in Each Category

-- Tables
Products(product_id, product_name, category)
Sales(sale_id, product_id, sale_date, revenue)

-- Solution
SELECT category, product_id, total_revenue
FROM (
  SELECT 
    p.category,
    s.product_id,
    SUM(s.revenue) AS total_revenue,
    RANK() OVER (PARTITION BY p.category ORDER BY SUM(s.revenue) DESC) AS rnk
  FROM Sales s
  JOIN Products p ON s.product_id = p.product_id
  GROUP BY p.category, s.product_id
) ranked
WHERE rnk <= 3
ORDER BY category, total_revenue DESC;

Key Concepts: RANK(), partitioning, aggregation.

2. Find Customers Who Never Purchased

-- Tables
Customers(customer_id, name)
Orders(order_id, customer_id, order_date)

-- Solution
SELECT c.customer_id, c.name
FROM Customers c
LEFT JOIN Orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Key Concepts: LEFT JOIN + IS NULL filtering to find missing relations.

3. Running Total of Daily Active Users

-- Tables
UserActivity(user_id, activity_date)

-- Solution
WITH daily AS (
  SELECT activity_date, COUNT(DISTINCT user_id) AS daily_users
  FROM UserActivity
  GROUP BY activity_date
)
SELECT 
  activity_date,
  SUM(daily_users) OVER (ORDER BY activity_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM daily
ORDER BY activity_date;

4. Second Highest Salary

-- Table
Employee(id, name, salary)

-- Solution (ANSI SQL)
SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
OFFSET 1 ROW FETCH NEXT 1 ROW ONLY;

-- MySQL alternative
SELECT MAX(salary) AS SecondHighestSalary
FROM Employee
WHERE salary < (SELECT MAX(salary) FROM Employee);

5. Find Users Who Made Purchases on Consecutive Days

-- Table
Purchases(user_id, purchase_date)

-- Solution
SELECT DISTINCT p1.user_id
FROM Purchases p1
JOIN Purchases p2
  ON p1.user_id = p2.user_id
 AND p2.purchase_date = p1.purchase_date + INTERVAL '1 day';

Summary: These problems exercise window functions, joins, aggregation, subqueries, and date arithmetic — all common in data scientist OAs.

User (follow-up)

I've never used sql before but i know python. walk me through this example line by line, explaining thoroughly

SELECT c.customer_id, c.name
FROM Customers c
LEFT JOIN Orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Assistant (detailed walkthrough)

Nice — I’ll walk you through that query line-by-line, show a tiny concrete example so you can see what the database does at each step, and give a Python/pandas equivalent so it feels familiar.

The query we’re explaining

SELECT c.customer_id, c.name
FROM Customers c
LEFT JOIN Orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

1) What the tables look like

Customers

| customer_id | name  |
|-------------|-------|
| 1           | Alice |
| 2           | Bob   |
| 3           | Carol |

Orders

| order_id | customer_id | order_date  |
|----------|-------------|-------------|
| 100      | 1           | 2025-10-01  |
| 101      | 1           | 2025-10-02  |
| 102      | 3           | 2025-10-05  |

2) Line-by-line explanation

SELECT c.customer_id, c.name

- SELECT chooses columns to return; c. is an alias for Customers.

FROM Customers c

- FROM names the main (left) table; c is shorthand.

LEFT JOIN Orders o

- LEFT JOIN keeps every row from the left table and attaches matching rows from the right; unmatched right-side columns become NULL.

ON c.customer_id = o.customer_id

- Defines how rows are matched between tables.

WHERE o.customer_id IS NULL;

- Filters the joined result, keeping only rows where the right-side join produced NULL — i.e., customers with no orders.

3) What happens with our example data (step-by-step)

LEFT JOIN result (before WHERE)

| c.customer_id | c.name | o.order_id | o.customer_id | o.order_date |
|---------------|--------|------------|---------------|--------------|
| 1             | Alice  | 100        | 1             | 2025-10-01   |
| 1             | Alice  | 101        | 1             | 2025-10-02   |
| 2             | Bob    | NULL       | NULL          | NULL         |
| 3             | Carol  | 102        | 3             | 2025-10-05   |

After applying WHERE o.customer_id IS NULL

| c.customer_id | c.name |
|---------------|--------|
| 2             | Bob    |

4) Python / pandas equivalent

import pandas as pd

customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol']
})

orders = pd.DataFrame({
    'order_id': [100, 101, 102],
    'customer_id': [1, 1, 3],
    'order_date': ['2025-10-01', '2025-10-02', '2025-10-05']
})

merged = customers.merge(orders, on='customer_id', how='left')
no_orders = merged[merged['order_id'].isna()][['customer_id', 'name']]

print(no_orders)

Output: Bob is the only customer with no orders.

5) Performance notes

Index Orders.customer_id for large tables. Consider NOT EXISTS as an alternative if performance testing shows benefits.

User (request)

convert this whole conversation into formatted html for a website

Assistant (export)

I exported the conversation as a formatted HTML file titled Conversation_to_HTML.html. You can view and edit it in the project files / editor. The file contains styled message blocks, code snippets, and notes suitable for embedding in a website.

If you'd like the HTML adapted for a different design (light theme, print-friendly, or ready-to-publish blog post), tell me which style you prefer and I'll update it.