Post

Supervised Learning Math (for humans, not robots)

Supervised Learning Math (for humans, not robots)

Alright… Let’s talk about supervised learning, but in a way that does not feel like a math exam from hell.

I’ll explain the core math ideas, why they matter, and I’ll throw Clojure and Java code while we go.
No scary words. No academic nonsense. Just the stuff that actually matters.

Think of this as:

“Math that makes machines less dumb”


1. What supervised learning really is

At its heart, supervised learning is pattern matching with feedback.

You give the model:

  • Inputs: numbers (features)
  • Outputs: the right answer (labels)

Example:

1
2
3
4
x = [size, rooms]
y = price

The model:

  • guesses a price
  • compares with the real price
  • feels bad
  • adjusts itself
  • repeats

That’s it. Everything else is details.


2. The core equation (no panic)

Most supervised models boil down to this:

1
2
3
prediction = f(x, weights) + bias

For a simple linear model:

1
2
3
y_hat = w * x + b

Where:

  • x = input
  • w = weight (how important x is)
  • b = bias (base value)
  • y_hat = predicted output

This is just a line. Yes. High school math.


3. Loss function: how wrong am I?

The model needs a way to measure how bad the prediction was.

The most common one:

Mean Squared Error (MSE)

1
2
3
loss = (y_hat - y)^2

Why square?

  • no negative values
  • big mistakes hurt more

Clojure example

1
2
3
4
5
6
(defn mse [y y-hat]
  (let [diff (- y-hat y)]
    (* diff diff)))

(mse 10 8)
;; => 4

Java example

1
2
3
4
public static double mse(double y, double yHat) {
    double diff = yHat - y;
    return diff * diff;
}

Simple. Brutal. Honest.


4. Optimization: how do we get less wrong?

Now the important part.

We want to change the weights so the loss goes down.

This is where gradient descent comes in.

Fancy name. Simple idea.

Walk downhill until pain stops.


5. Derivative (don’t freak out)

A derivative answers one question:

If I change w a little, what happens to the loss?

For linear regression with MSE, the derivative looks like this:

1
dL/dw = 2 * x * (y_hat - y)

Translation:

  • if prediction is too high → push weight down
  • if prediction is too low → push weight up

6. Updating the weight

The update rule:

1
w = w - learning_rate * gradient

Where:

  • learning_rate = how fast we learn
  • too big → chaos
  • too small → forever student

7. Full tiny training step

Clojure version

1
2
3
4
5
6
7
8
9
10
(defn predict [x w b]
  (+ (* w x) b))

(defn train-step [x y w b lr]
  (let [y-hat (predict x w b)
        grad-w (* 2 x (- y-hat y))
        new-w (- w (* lr grad-w))]
    new-w))

(train-step 3 10 1 0 0.01)

Pure functions. No magic. Very Clojure. Very zen.


Java version

1
2
3
4
5
6
7
8
9
public static double predict(double x, double w, double b) {
    return w * x + b;
}

public static double trainStep(double x, double y, double w, double b, double lr) {
    double yHat = predict(x, w, b);
    double gradW = 2 * x * (yHat - y);
    return w - lr * gradW;
}

Yes, this is literally how ML starts.


8. Multiple features (same idea, more numbers)

If you have many inputs:

1
2
x = [x1, x2, x3]
w = [w1, w2, w3]

Prediction becomes:

1
y_hat = x · w + b

Dot product. Vector stuff. Still chill.

Clojure dot product

1
2
3
4
5
(defn dot [a b]
  (reduce + (map * a b)))

(dot [1 2 3] [4 5 6])
;; => 32

Java dot product

1
2
3
4
5
6
7
public static double dot(double[] a, double[] b) {
    double sum = 0;
    for (int i = 0; i < a.length; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

Same logic. Different syntax. No drama.


9. Classification? Same math, different loss

For classification:

  • output is usually probability
  • you use sigmoid or softmax
  • loss becomes cross-entropy

But the core loop stays the same:

  1. predict
  2. measure error
  3. compute gradient
  4. update weights
  5. repeat until coffee ends

10. Mental model (this is the important part)

Forget formulas for a second.

Supervised learning is:

  • guessing
  • checking how wrong you are
  • adjusting knobs
  • looping like an obsessed engineer

The math just keeps this process stable and efficient.


Final thoughts

If someone says:

“Supervised learning math is too complex”

They are lying or selling a course.

At the core:

  • multiplication
  • subtraction
  • a bit of slope
  • lots of repetition

The magic is not the math. The magic is data + iteration + patience.

And yes, machines learn like stubborn juniors. They just never sleep :think:

This post is licensed under CC BY 4.0 by the author.