Supervised Learning Math (for humans, not robots)

Posted Jan 30, 2026

By Paulo Victor Leite Lima Gomes

4 min read

Alright… Let’s talk about supervised learning, but in a way that does not feel like a math exam from hell.

I’ll explain the core math ideas, why they matter, and I’ll throw Clojure and Java code while we go.
No scary words. No academic nonsense. Just the stuff that actually matters.

Think of this as:

“Math that makes machines less dumb”

1. What supervised learning really is

At its heart, supervised learning is pattern matching with feedback.

You give the model:

Inputs: numbers (features)
Outputs: the right answer (labels)

Example:

x = [size, rooms]
y = price

The model:

guesses a price
compares with the real price
feels bad
adjusts itself
repeats

That’s it. Everything else is details.

2. The core equation (no panic)

Most supervised models boil down to this:

prediction = f(x, weights) + bias

For a simple linear model:

y_hat = w * x + b

Where:

x = input
w = weight (how important x is)
b = bias (base value)
y_hat = predicted output

This is just a line. Yes. High school math.

3. Loss function: how wrong am I?

The model needs a way to measure how bad the prediction was.

The most common one:

Mean Squared Error (MSE)

loss = (y_hat - y)^2

Why square?

no negative values
big mistakes hurt more

Clojure example

  
(defn mse [y y-hat]
  (let [diff (- y-hat y)]
    (* diff diff)))

(mse 10 8)
;; => 4

Java example

  
public static double mse(double y, double yHat) {
    double diff = yHat - y;
    return diff * diff;
}

Simple. Brutal. Honest.

4. Optimization: how do we get less wrong?

Now the important part.

We want to change the weights so the loss goes down.

This is where gradient descent comes in.

Fancy name. Simple idea.

Walk downhill until pain stops.

5. Derivative (don’t freak out)

A derivative answers one question:

If I change w a little, what happens to the loss?

For linear regression with MSE, the derivative looks like this:

dL/dw = 2 * x * (y_hat - y)

Translation:

if prediction is too high → push weight down
if prediction is too low → push weight up

6. Updating the weight

The update rule:

w = w - learning_rate * gradient

Where:

learning_rate = how fast we learn
too big → chaos
too small → forever student

7. Full tiny training step

Clojure version

  
(defn predict [x w b]
  (+ (* w x) b))

(defn train-step [x y w b lr]
  (let [y-hat (predict x w b)
        grad-w (* 2 x (- y-hat y))
        new-w (- w (* lr grad-w))]
    new-w))

(train-step 3 10 1 0 0.01)

Pure functions. No magic. Very Clojure. Very zen.

Java version

  
public static double predict(double x, double w, double b) {
    return w * x + b;
}

public static double trainStep(double x, double y, double w, double b, double lr) {
    double yHat = predict(x, w, b);
    double gradW = 2 * x * (yHat - y);
    return w - lr * gradW;
}

Yes, this is literally how ML starts.

8. Multiple features (same idea, more numbers)

If you have many inputs:

x = [x1, x2, x3]
w = [w1, w2, w3]

Prediction becomes:

y_hat = x · w + b

Dot product. Vector stuff. Still chill.

Clojure dot product

  
(defn dot [a b]
  (reduce + (map * a b)))

(dot [1 2 3] [4 5 6])
;; => 32

Java dot product

  
public static double dot(double[] a, double[] b) {
    double sum = 0;
    for (int i = 0; i < a.length; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

Same logic. Different syntax. No drama.

9. Classification? Same math, different loss

For classification:

output is usually probability
you use sigmoid or softmax
loss becomes cross-entropy

But the core loop stays the same:

predict
measure error
compute gradient
update weights
repeat until coffee ends

10. Mental model (this is the important part)

Forget formulas for a second.

Supervised learning is:

guessing
checking how wrong you are
adjusting knobs
looping like an obsessed engineer

The math just keeps this process stable and efficient.

Final thoughts

If someone says:

“Supervised learning math is too complex”

They are lying or selling a course.

At the core:

multiplication
subtraction
a bit of slope
lots of repetition

The magic is not the math. The magic is data + iteration + patience.

And yes, machines learn like stubborn juniors. They just never sleep :think:

Machine Learning, AI

AI Math software-engineering

This post is licensed under CC BY 4.0 by the author.