7 min readNov 22, 2024

Learning to drive the terminology of AI: What is F Score?

Background: Why a F score is needed in AI applications?

AI and machine learning need monitoring for the quality. Quality can be determined sometimes in a very crystallized way. In this article, the model that we use for talking about quality, is simplified: we have inputs, and we have correct outputs. Our AI model needs to learn the correct answer. And then, we can talk about quality.

In the realm of AI and computer applications, the concept of quality can often be ambiguous. Nevertheless, it is crucial to establish straightforward metrics for consistent tracking.

F-score serves as a fundamental metric in statistical methods, learning algorithms, and optimization challenges.

Quality of the learning algorithm´s outputs on unseen and seen data (ie its correctness performance in “the real deal”) can be measured by calculating a F-score.

Having quite little prior experience, I wanted to understand deeply what is the F score capable of.

Read on!

How is F score calculated?

The F score needs:

inputs
output = prediction of the AI model
a known ground truth (correct answer)

Lets recap quickly:

In an AI application, we give the AI model “inputs”.

The model then calculates us Output.

We have internally the knowledge of a ground truth = the correct output for the particular input.

You will be making 3 useful counter variables. These are placed into a Confusion matrix.

Remember; matrix is just a rectangular storage of numbers. It has its own algebraic rules, but for now, we just can think of a matrix as a storage box.

F1 Score measures the quality of a predictor mechanism in machine learning application. In other words, it

What´s really cool about F score is that:

F score is easy to calculate:
calculation involves using 3 numbers
calculation is done from data in Confusion matrix
the Confusion matrix just stores “counts” of cases that are easy to get counted

The F score is based on three out of these 4 components in the confusion matrix:

true positives
false positives
true negatives (are NOT counted!)
false negatives

F1 score indicates, how accurately — and to what extent (coverage) a method can identify “cases” correctly. And also, how many times it gets a incorrect predictions. We would otherwise, if using simply a fraction of correct classification solely, get a unfair judgment of the overall reliability of a method.

So the importance of having a “balanced” score, like the F1 score is:

a low-quality AI method that would correctly identify cases (high True Positives), but also mis-identify falsely (high FP component = false positives) a lot of cases, is given a low F-score.
high-quality AI method is given high F-score

Why 4 components in the confusion matrix? Isn’t 2 enough??

Note:
if we look at the behavior of the F1 score formula, we find that three: True Positive, False Positive and False negatives are used. True negatives are not used.

So F1 score formula uses only 3 metrics out of the 4. Why?

Why Doesn't F1 Score use all 4 parameters?

Since we know that confusion matrix counts 4 types cases, it seems a bit strange that we do not use all of the 4 in the F1.

The reason is that in certain applications, the amount of true negatives is very, very large — proportionally to other types of cases. For example, in email spam detection, the amount of true negatives can be huge.

Score calculation use TN?

F1 Score Focuses on Positive Class:

The F1 Score is typically used in binary classification tasks where the focus is on the performance of the positive class (e.g., detecting spam emails, identifying diseases).
TN measures the correctly classified negatives, which is not directly relevant to the positive class's performance.

Behavior of F1 Score on Imbalanced Datasets:

In highly imbalanced datasets (e.g., fraud detection), the number of true negatives can dominate the overall performance metrics, skewing accuracy. The F1 Score avoids this by focusing only on the positive predictions.

We do NOT want a slasher AI, that just shoots randomly predictions, in hopes of getting a high one-sided accuracy score (even though it would still be neglegting to see its own errors, and not improving).

Whereas, if a AI is measured with a F score, this would reveal such poorly performing AI. They would not score a high F score!

So let’s get back to my naive idea:

I wanted to combine a fun way of actually seeing, on my own eyes, how F1 score would evaluate, in a car driving a maze -simulator. Here is the idea:

Lets make a simple 2D car simulator. There is a car (block), that moves in a maze. It can only go forward, turn left, or turn right.

These elementary choices are called “steps”.

You have a Python program, where a randomly created maze is being driven by the car. The car starts going forward, and taking left/right turns. It hopes to arrive at a goal (point in the maze). So basically we are talking about a rather simple thing.

Questions I then started thinking:

How do we arrange meaningful Learning into the car / maze problem?

So how does our simulation car learn? We would need supervision to guide learning:

first we would need a definition of “correct”
is correct a whole path, or is it just a little action done by car? like accelerate, turn left, and so on.. ?
or supervision by having a human evaluate the goodness of each step, individually for the car

How could we learn anything?

Learning in AI is about teaching a network (graph of nodes) about what is a error term (value), for each prediction. So the networks learns basically by adapting (changing) weights of its nodes. These are being changed by calculating a gradient. Gradient makes the change direction suitable for a neuron. The acceleration (learning rate) is a given exterior (hyperparameter), which we can choose. There is no magic silvet bullet about a learning rate choice. Too high a Learning rate makes the AI model behave instably, possibly mispredicting and missing altogether a desired goal.

Pitfalls of high learning rate in AI

The high learning rate is like a over-speeding car: you do have a genuine positive idea of wanting the car to arrive quickly in to the destination, but you wreak havoc instead by risking too much!

Back to F1 score and my idea of car simulation. How can you learn?

This is a good question. Sometimes it is really important to step back, and get your ideas straight.

Can we make the 2D maze driving car actually “learn”?

Guess so. It is the next part in the series, Hands on with a Simcar.

Lets put some goals of the simcar:

car must arrive to a goal (place in maze), with the least amount of steps
a real car shall not crash to any obstacles
car must keep a safe distance to people (pedestrians), other cars, walls, signs, and so on
car must keep lane
do not stop in a place that would constitute a hazard
car must obey legal limitations
car must judge situations and the safety of all its choices
obey the lane dimensions: do not unnecessarily wander around
keep a safe speed
take turns in a manner that does not cause distractions to other cars
and keep the vechile speed to set Speed limitations on the road §
takeover if there is another car, and it is safe and economically feasible to takeover

We can make some kind of heuristic, memory, or such to the car behavior. This gives us an algorithm. The algorithm can be described as simply as: the eternal right-turning rover, that simply exhaustively goes through maze walls, following them, and hopes to get to the goal.

The algorithm could be something totally other kind.

Nevertheless, we can ultimately tally up (count) the results of the car performance in the maze.

F-score or F1 score in machine learning is one number, that compactly describes recall and precision of decisions.

But we need ground truth!

Ground truth is the “correct” step, choice, decision. So we could take a known, classical path-optimization algorithm, use that to calculate the ground truth path, and then arrange probably for counting the 4 components to calculating F1 score, right?

So we would comparing our naive algorithm with the optimal results.

There are a few different concepts in measuring the “quality”of a AI system:

F1 score
recall
false positives
false negatives
true positives
true negatives
confusion matrix

Confusion matrix is used to store counts of these kind of occurrences (in the list). So the confusion matrix is the storage, and it is used further down, to calculating the other concepts, like F1 score.

Lets get our hands dirty.

F1 score formula is

What I want to do and give as takeaway concepts:

understanding F1 Score´s role
the calculation formula of F1 score
how computationally hard or easy is F1 score calculation in computers?
results and interpreting the F1 Score
what do you actually do when you have the F1 score?
are there some standards for where to aim in, for F1 score?
can F1 score be used, as such, a metric for decisions on either the data gathering method, or algorithm (model) development in a particular already existing AI solution?

False positives/negatives, and true positives/negatives

FP = make up a thing

TP = identify correctly a thing

FN = claim falsely, that something isn´t

TN = identify correctly a negative outcome

Given a labeled data, if we are labeling samples (input) with our AI, then:

a FP (false positive) is where the AI tells something “is”, but in reality: something is not
false negative is where AI tells something “is not”, whereas it really IS. So as example, false negative is a AI model
true positive is accurate (correct) prediction of “is” case
true negative is accurate (correct) predictions of “is not” case

How the 4 components are counted?

You need to have the ground truth available. So ground truth is the real results.

An AI model predicts a result, and then you compare prediction against ground truth. You can then count these kind of correct and incorrect occurrences, there are 2 of each:

1 FP = AI model categorized as positive, but sample (ground truth) is negative
1 TP = AI model correctly categorized as positive = ground truth
1 FN = AI model incorrectly categorized input as a negative (ground truth is positive)
1 TN = AI model correct categorized as negative

F1 Score is precision and recall combined in one score.

Critical thinking, links to videos

Why is the Formula for F1-Score Unnecessarily Complicated?

Why Doesn't F1 Score use all 4 parameters?

Score calculation use TN?

Written by Jukka Paulin

No responses yet