Supervised Learning Math (for humans, not robots)
Alright… Let’s talk about supervised learning, but in a way that does not feel like a math exam from hell.
I’ll explain the core math ideas, why they matter, and I’ll throw Clojure and Java code while we go.
No scary words. No academic nonsense. Just the stuff that actually matters.
Think of this as:
“Math that makes machines less dumb”
1. What supervised learning really is
At its heart, supervised learning is pattern matching with feedback.
You give the model:
- Inputs: numbers (features)
- Outputs: the right answer (labels)
Example:
1
2
3
4
x = [size, rooms]
y = price
The model:
- guesses a price
- compares with the real price
- feels bad
- adjusts itself
- repeats
That’s it. Everything else is details.
2. The core equation (no panic)
Most supervised models boil down to this:
1
2
3
prediction = f(x, weights) + bias
For a simple linear model:
1
2
3
y_hat = w * x + b
Where:
x= inputw= weight (how important x is)b= bias (base value)y_hat= predicted output
This is just a line. Yes. High school math.
3. Loss function: how wrong am I?
The model needs a way to measure how bad the prediction was.
The most common one:
Mean Squared Error (MSE)
1
2
3
loss = (y_hat - y)^2
Why square?
- no negative values
- big mistakes hurt more
Clojure example
1
2
3
4
5
6
(defn mse [y y-hat]
(let [diff (- y-hat y)]
(* diff diff)))
(mse 10 8)
;; => 4
Java example
1
2
3
4
public static double mse(double y, double yHat) {
double diff = yHat - y;
return diff * diff;
}
Simple. Brutal. Honest.
4. Optimization: how do we get less wrong?
Now the important part.
We want to change the weights so the loss goes down.
This is where gradient descent comes in.
Fancy name. Simple idea.
Walk downhill until pain stops.
5. Derivative (don’t freak out)
A derivative answers one question:
If I change
wa little, what happens to the loss?
For linear regression with MSE, the derivative looks like this:
1
dL/dw = 2 * x * (y_hat - y)
Translation:
- if prediction is too high → push weight down
- if prediction is too low → push weight up
6. Updating the weight
The update rule:
1
w = w - learning_rate * gradient
Where:
learning_rate= how fast we learn- too big → chaos
- too small → forever student
7. Full tiny training step
Clojure version
1
2
3
4
5
6
7
8
9
10
(defn predict [x w b]
(+ (* w x) b))
(defn train-step [x y w b lr]
(let [y-hat (predict x w b)
grad-w (* 2 x (- y-hat y))
new-w (- w (* lr grad-w))]
new-w))
(train-step 3 10 1 0 0.01)
Pure functions. No magic. Very Clojure. Very zen.
Java version
1
2
3
4
5
6
7
8
9
public static double predict(double x, double w, double b) {
return w * x + b;
}
public static double trainStep(double x, double y, double w, double b, double lr) {
double yHat = predict(x, w, b);
double gradW = 2 * x * (yHat - y);
return w - lr * gradW;
}
Yes, this is literally how ML starts.
8. Multiple features (same idea, more numbers)
If you have many inputs:
1
2
x = [x1, x2, x3]
w = [w1, w2, w3]
Prediction becomes:
1
y_hat = x · w + b
Dot product. Vector stuff. Still chill.
Clojure dot product
1
2
3
4
5
(defn dot [a b]
(reduce + (map * a b)))
(dot [1 2 3] [4 5 6])
;; => 32
Java dot product
1
2
3
4
5
6
7
public static double dot(double[] a, double[] b) {
double sum = 0;
for (int i = 0; i < a.length; i++) {
sum += a[i] * b[i];
}
return sum;
}
Same logic. Different syntax. No drama.
9. Classification? Same math, different loss
For classification:
- output is usually probability
- you use sigmoid or softmax
- loss becomes cross-entropy
But the core loop stays the same:
- predict
- measure error
- compute gradient
- update weights
- repeat until coffee ends
10. Mental model (this is the important part)
Forget formulas for a second.
Supervised learning is:
- guessing
- checking how wrong you are
- adjusting knobs
- looping like an obsessed engineer
The math just keeps this process stable and efficient.
Final thoughts
If someone says:
“Supervised learning math is too complex”
They are lying or selling a course.
At the core:
- multiplication
- subtraction
- a bit of slope
- lots of repetition
The magic is not the math. The magic is data + iteration + patience.
And yes, machines learn like stubborn juniors. They just never sleep :think: