Report cover image

Ai Exam Revision Guide Uncertainty Bayesian Mle Logistic Regression

07/05/2025 11:24

Image from Pexels

Ai Exam Revision Guide Uncertainty Bayesian Mle Logistic Regression

Created: 07/05/2025 11:24
Download PDF
12 views
13 downloads

AI2 Exam Revision Guide

1. Understanding Uncertainty in AI

Uncertainty in AI refers to the fact that outcomes or model parameters are not known with certainty. Three key approaches to handle uncertainty are: Bayesian estimation, Maximum Likelihood Estimation (MLE), and probabilistic models like Logistic Regression. These approaches use probability to quantify uncertainty in different ways:

Bayesian Estimation: Treats model parameters as random variables and updates a prior belief to a posterior distribution after observing data

stats.stackexchange.com

. Instead of giving a single best estimate, Bayesian methods produce an entire probability distribution of the parameter $\theta$ given data $D$ (the posterior $p(\theta|D)$). This means we incorporate prior knowledge and get a measure of uncertainty in the parameter itself.

Maximum Likelihood Estimation (MLE): Treats model parameters as fixed but unknown constants. We choose the parameter values that maximize the likelihood of observing the given data

stats.stackexchange.com

. In other words, MLE finds the parameter that makes the observed data “most probable.” MLE does not use a prior distribution (it can be seen as a special case of Bayesian estimation with a uniform prior).

Logistic Regression: A predictive model that provides probabilistic outputs for classification. It models the probability of a binary outcome using the sigmoid (logistic) function so that predictions range from 0 to 1 (i.e., a probability)

developers.google.com

. Logistic regression thus explicitly acknowledges uncertainty in predictions – e.g. “there is a 93% probability of spam” instead of a hard yes/no

developers.google.com

. It uses MLE under the hood to fit the model coefficients (weights) to data.

When to use which? Bayesian estimation is useful when you have prior knowledge or need a full distribution over parameters (helpful for small data regimes or when quantifying confidence in parameters). MLE is often used when you have lots of data and want a single best estimate (it’s simpler and computationally efficient). Logistic regression is chosen for binary classification tasks to get probability outputs (it uses MLE for training but gives probabilistic predictions, which is critical for decision-making under uncertainty). In practice, understanding these approaches helps in choosing the right tool: for example, you might use Bayesian methods in a medical diagnosis AI (where prior expert knowledge is valuable and you want confidence intervals for predictions), but use MLE/logistic regression for a large-scale spam classifier where data is plentiful and a quick probability estimate is needed. Key point: All these techniques embed uncertainty into AI models – either by producing a probability distribution (Bayesian), an optimal point estimate from data (MLE), or a probabilistic prediction of an outcome (logistic regression). Embracing uncertainty leads to more robust and informed AI decisions. (Self-written notes: … )

2. Bayesian Estimation and MLE

Bayesian Estimation and Maximum Likelihood Estimation (MLE) are two fundamental parameter estimation philosophies. Below are their definitions, formulas, and usage:

Maximum Likelihood Estimation (MLE):

Definition: MLE finds the parameter value that maximizes the likelihood of the observed data

stats.stackexchange.com

. Essentially, we choose $\hat{\theta}_{MLE}$ such that

θ

^

M

L

E

=

arg

max

θ

L

(

θ

)

=

arg

max

θ

P

(

D

θ

)

θ

^

MLE

=argmax

θ

L(θ)=argmax

θ

P(D∣θ) where $L(\theta) = P(D|\theta)$ is the likelihood of data $D$ given parameter $\theta$. This yields the parameter that makes the data most probable.

Interpretation: MLE treats the parameter as an unknown constant. By maximizing $P(D|\theta)$, we assume the observed data is representative and “trust” it entirely for estimating $\theta$. No prior belief is used. The result is a single point estimate for $\theta$.

When to Use: Use MLE when you have sufficient data and want an unbiased, data-driven estimate. It’s computationally convenient (often leads to well-known formulas, e.g. the sample mean for Gaussian mean estimation). However, MLE can overfit when data is scarce, since it doesn’t incorporate prior knowledge.

Bayesian Estimation:

Definition: Bayesian estimation treats the parameter $\theta$ as a random variable with a prior distribution $p(\theta)$. After observing data $D$, we update this prior via Bayes’ Theorem to get the posterior distribution $p(\theta|D) = \frac{P(D|\theta),p(\theta)}{P(D)}$

stats.stackexchange.com

. Rather than a single value, the result is a posterior distribution over parameters. If a point estimate is needed, one can use the posterior mean or the MAP (maximum a posteriori) estimate (the mode of the posterior).

Interpretation: We “put in” a prior belief and the data (likelihood), and “get out” a distribution for $\theta$

stats.stackexchange.com

. This posterior quantifies our updated uncertainty about $\theta$. For example, if prior $p(\theta)$ is $Uniform(0,1)$ and data comes from 7 heads in 10 coin flips, the posterior for the bias $\theta$ is $Beta(7+1,;3+1) = Beta(8,4)$ – a distribution reflecting most probability mass around 0.66–0.8 but not zero uncertainty outside that range.

When to Use: Use Bayesian estimation when prior information is available or when quantifying uncertainty in the estimate is important. It is especially helpful with limited data or in critical applications (e.g. scientific experiments, medical AI) where incorporating prior domain knowledge improves estimates. The trade-off is computational complexity: Bayesian methods often require integrating over distributions (which can be complex), whereas MLE just requires an optimization.

Relationship (MLE as a special case of Bayesian): If you use a flat (uniform) prior, the Bayesian posterior maximization yields the MAP estimate, which in that case equals the MLE. Bayesian estimation is thus a generalization: it reduces to MLE when no informative prior is given

stats.stackexchange.com

. MLE provides a single best guess, whereas Bayesian provides a distribution of guesses.

Summary:

Formula-wise: MLE solves $\max_\theta P(D|\theta)$; Bayesian updates $p(\theta)$ via $p(\theta|D)\propto P(D|\theta),p(\theta)$.

Use-case: MLE for large data or when prior is unknown/unimportant; Bayesian for incorporating priors or needing uncertainty of $\theta$.

Outcome: MLE = single estimate (e.g., “the best fit weight is 2.5”); Bayesian = distribution (e.g., “weight is around 2.5 ± 0.3 with 95% confidence”).

(Self-written notes: … )

3. Logistic Regression

Logistic Regression is a classification model that predicts the probability of a binary outcome (e.g. success/failure, spam/ham). It combines a linear model with the sigmoid (logistic) function to output probabilities between 0 and 1

developers.google.com

.

https://developers.google.com/machine-learning/crash-course/logistic-regression/sigmoid-function

Logistic (sigmoid) function mapping any real-valued input to the range [0,1]. This S-shaped curve is used in logistic regression to convert a linear prediction (log-odds) into a probability

developers.google.com

.

Formula and Sigmoid Function

Logistic Model: The model assumes $\text{log-odds}$ of the positive class is a linear combination of features. If $y\in{0,1}$ is the class (0 = negative, 1 = positive), and $\mathbf{x} = (x_1,\ldots,x_n)$ are features, then:

log

P

(

y

=

1

x

)

P

(

y

=

0

x

)

=

w

0

+

w

1

x

1

+

+

w

n

x

n

.

log

P(y=0∣x)

P(y=1∣x)

=w

0

+w

1

x

1

+⋯+w

n

x

n

.

Here $w_0$ is the intercept (bias) and $w_i$ are weights for features. This can be rewritten as $P(y=1|\mathbf{x}) = \sigma(w_0 + \sum_{i} w_i x_i)$ where $\sigma$ is the sigmoid.

Sigmoid Function: $\displaystyle \sigma(z) = \frac{1}{1 + e^{-z}}$

developers.google.com

. This function takes any real number $z$ (which could be the linear model output) and squashes it into the range $(0,1)$, interpreting it as a probability. For example, $\sigma(0)=0.5$, $\sigma(2)\approx0.88$, $\sigma(-2)\approx0.12$. The sigmoid ensures the model’s output is a valid probability.

The logistic regression model thus outputs $P(y=1|\mathbf{x})$ directly. It is trained using MLE: the weights $w_i$ are chosen to maximize the likelihood of the observed training labels (equivalently, minimize the log-loss). This usually involves an optimization (like gradient ascent/descent) since a closed-form solution isn’t available for logistic regression.

Interpreting Coefficients and Decision Boundary

Coefficients (Weights): The weight $w_j$ corresponds to feature $x_j$ and influences the odds of the outcome. For a one-unit increase in $x_j$, holding others constant, the log-odds changes by $w_j$. Equivalently, the odds (ratio $P(y=1)/P(y=0)$) get multiplied by $e^{w_j}$. For example, if $w_j = 0.7$, then increasing $x_j$ by 1 multiplies the odds by $e^{0.7}\approx2.01$ (roughly doubling the odds of $y=1$). A positive weight means that feature increases the probability of $y=1$ (odds > 1), while a negative weight means it decreases that probability (odds < 1). If a weight is zero, that feature has no effect on the odds.

Intercept $w_0$: This sets the baseline odds when all $x_i=0$. $w_0$ is the log-odds of $y=1$ with no features.

Decision Boundary: To make a final decision (0 or 1), we often threshold the predicted probability at 0.5 (or another threshold). Using 0.5: predict 1 if $P(y=1|\mathbf{x}) \ge 0.5$, else 0. This corresponds to $\text{log-odds} \ge 0$ or the linear score $w_0 + \sum w_i x_i \ge 0$ as the boundary. In feature space, this decision boundary is a hyperplane (linear separator) – e.g., for two features it’s a line, for three it’s a plane, etc. Points on one side of the boundary are classified as class 1, the other side as class 0.

Why Logistic Regression (strengths): It provides probabilistic outputs, which is valuable for understanding prediction confidence

developers.google.com

. For instance, rather than simply saying “spam” or “not spam,” logistic regression might output 0.93 probability of spam

developers.google.com

– allowing the system or user to handle uncertainty (maybe flag as spam if >0.9, or “unsure” if around 0.5–0.6). Logistic regression is also easily interpretable via weights (you can see which features contribute positively or negatively to the prediction). It assumes a linear decision boundary in terms of log-odds, which is generally flexible enough for many problems, yet simpler than more complex models like neural networks. Worked Example: Suppose we build a logistic model to predict whether a student will pass an exam ($y=1$ for pass) based on hours studied ($x$). Let the learned model be $\log\frac{P(\text{pass})}{P(\text{fail})} = -3 + 0.2x$. Here $w_0=-3$, $w_1=0.2$. This means:

At $x=0$ hours, $\log\frac{P(pass)}{P(fail)}=-3$, so $P(pass)=\sigma(-3)\approx0.05$ (only 5% chance to pass with no study).

Each additional hour multiplies the odds of passing by $e^{0.2}\approx1.22$ (22% increase in odds).

If a student studies $x=20$ hours: linear score $= -3+0.2(20)=+1$, so $P(pass)=\sigma(1)\approx0.73$ (73% chance). If $x=15$ hours: score $= -3+3=0$, $P(pass)=0.50$ (roughly even odds). Thus, the decision boundary is at 15 hours (where probability is 0.5). A student must study more than ~15 hours to be more likely than not to pass in this hypothetical model.

(Self-written notes: … )

4. Information Theory and Decision Trees

Decision trees classify data by splitting it based on feature values, and concepts from Information Theory guide how these splits are chosen. The key ideas are entropy, information gain, and mutual information. These measure how “informative” a feature is about the class label, helping the tree decide which feature to split on at each step. We also outline how a tree is constructed using these metrics (often via the ID3 algorithm).

Entropy (Measure of Uncertainty)

Entropy, in an ML context, quantifies the impurity or uncertainty in a set of examples

en.wikipedia.org

. If we have a set $S$ of training examples, with classes (e.g. positive/negative), the entropy $H(S)$ is:

H

(

S

)

=

c

classes

p

(

c

)

log

2

p

(

c

)

,

H(S)=−∑

c∈classes

p(c)log

2

p(c),

where $p(c)$ is the proportion of examples in $S$ with class $c$. Entropy is measured in bits.

A pure set (all examples same class) has $H=0$ (no uncertainty – we can predict the class with no error).

A maximally impure set (e.g. 50/50 positive/negative) has $H=1$ bit for binary classes (maximum uncertainty).

Entropy illustrates purity of a set. Left: “Very Impure” (mixed classes, high entropy). Right: “Pure” (all one class, entropy = 0). A decision tree aims to reduce entropy with each split. Entropy can be thought of as how “mixed” the class labels are. A node in a decision tree with high entropy is “uncertain” (contains a mix of classes), while low entropy means most samples at that node share the same class. Decision trees prefer splits that reduce entropy, making child nodes more pure than the parent.

Information Gain and Mutual Information

Information Gain (IG) is the key criterion for choosing splits in ID3 (and related algorithms). It is defined as the reduction in entropy achieved by splitting on a given attribute

en.wikipedia.org

en.wikipedia.org

. Formally, for a dataset $T$ and an attribute $A$:

IG

(

T

,

A

)

=

H

(

T

)

H

(

T

A

)

,

IG(T,A)=H(T)−H(T∣A),

where $H(T)$ is the entropy of the full set and $H(T|A)$ is the expected entropy after splitting by $A$ (the weighted sum of entropies of each branch/value of $A$)

en.wikipedia.org

. A higher information gain means the attribute $A$ more effectively reduces uncertainty in $T$

en.wikipedia.org

. In essence, information gain is equivalent to the mutual information between the feature and the class label – it tells us how much knowing the feature reduces our uncertainty about the class. An attribute that perfectly classifies the data would have an information gain equal to the original entropy (because after the split entropy becomes 0). Attributes that don’t help will have low or zero information gain.

Mutual Information (I(X;Y)): A general concept from information theory measuring the amount of information one random variable contains about another. It can be defined as $I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$. In decision trees, $I(\text{Feature}; \text{Class})$ is exactly the information gain of that feature for predicting the class. So selecting the feature with highest IG is choosing the feature most informative about the class.

Example (Entropy & IG Calculation): Imagine a dataset $S$ of 14 instances about whether to play tennis, with the class distribution: 9 “Yes” and 5 “No”. The entropy is:

H

(

S

)

=

(

9

14

log

2

9

14

+

5

14

log

2

5

14

)

0.94

bits

.

H(S)=−(

14

9

log

2

14

9

+

14

5

log

2

14

5

)≈0.94 bits.

Now, suppose we consider splitting on the attribute Outlook which has values {Sunny, Overcast, Rainy}. The data splits into subsets: $S_{Sunny}$, $S_{Overcast}$, $S_{Rain}$. We compute their entropies (each weighted by fraction of records): say $H(S|Outlook) \approx 0.693$ bits (a weighted average). Then $IG(S, \text{Outlook}) = 0.94 - 0.693 = 0.247$ bits. If we do similarly for other attributes (Temperature, Humidity, Wind): we might get gains of 0.029, 0.152, 0.048 bits respectively (these are typical for the PlayTennis dataset). Outlook has the highest information gain (~0.247), so it would be chosen for the first split

medium.com

www3.cs.stonybrook.edu

. Overcast days are pure “Yes” in the data (all positive), Sunny and Rain need further splitting. This example shows how entropy and information gain drive the decision tree construction.

Building a Decision Tree (ID3 Algorithm)

A simplified outline of the tree construction algorithm (like ID3):

Start with all training data at the root. Compute the entropy of the class at this node. If the node is pure (entropy 0) or no features remain, it becomes a leaf. Otherwise, choose a feature to split.

Select the feature with highest Information Gain (i.e. that best reduces entropy)

www3.cs.stonybrook.edu

. This becomes the decision at the current node. Create one branch for each value of that feature.

Split the dataset into subsets according to the feature’s values. Calculate entropy for each subset (child node). Typically, entropy will be lower in these subsets than it was in the parent (since a good feature split reduces uncertainty).

Recurse: for each child node, if it’s not pure, repeat the process by choosing the best feature to split that subset. If a subset is perfectly pure or no remaining features, make it a leaf node with a class prediction (e.g., if 100% of examples in that node say “Yes”, it’s a Yes-leaf).

Continue until all leaves are pure or other stopping criteria are met (like a maximum tree depth, or no information gain possible).

The result is a tree where each path from root to leaf represents a sequence of decisions (tests on features) that lead to a classification. Worked Example (ID3 on “PlayTennis”):

We have attributes: Outlook, Temperature, Humidity, Wind, and target PlayTennis (Yes/No). Initially, $H_{\text{root}}=0.94$ bits as calculated. Information gains were: Outlook 0.247, Humidity 0.152, Wind 0.048, Temperature 0.029. So we split on Outlook (highest IG).

Outlook = Overcast: All training examples with Overcast are “Yes” (pure subset). Entropy = 0, this branch becomes a leaf predicting Yes (PlayTennis = Yes).

Outlook = Sunny: This subset has a mix of Yes/No (perhaps 2 Yes, 3 No in the dataset). We compute entropy and then information gain among remaining features (Temperature, Humidity, Wind) restricted to this subset. It turns out Humidity has the highest IG for the Sunny subset (e.g., perhaps $IG=0.97$ for Humidity vs lower for others)

medium.com

. So we split Sunny branch on Humidity.

Sunny & Humidity = High -> most examples were “No” (perhaps 3 No, 0 Yes). This becomes a leaf predicting No.

Sunny & Humidity = Normal -> most examples “Yes” (perhaps 2 Yes, 0 No). Leaf predicting Yes.

Outlook = Rainy: Subset also mixed (perhaps 3 Yes, 2 No). Check remaining features (Temperature, Humidity, Wind); suppose Wind has highest IG for Rainy subset (common result: Windy vs Not Windy influences playing when it’s rainy). Split on Wind:

Rainy & Wind = TRUE -> maybe mostly “No” outcomes (leaf predicts No).

Rainy & Wind = FALSE -> mostly “Yes” (leaf predicts Yes).

Now we have a complete tree. The structure of this learned decision tree would be:

https://www.enjoyalgorithms.com/blog/decision-tree-algorithm-in-ml/

An induced decision tree for the PlayTennis example. The root decision is Outlook (highest info gain)

www3.cs.stonybrook.edu

. Overcast leads directly to Yes. Sunny splits on Humidity, and Rainy splits on Wind, as determined by further information gain calculations. The yellow leaves show the final decision (Yes or No). Using this tree, one can classify new instances by traversing: e.g., a day with Outlook=Sunny, Humidity=High will follow the left branch then Humidity branch to predict “No”. Each split was chosen because it offered the most information gain, i.e. most reduction in entropy/uncertainty about the play decision. The above example demonstrates how entropy and information gain drive the greedy construction of a decision tree. (Self-written notes: … )

5. Bayesian Networks

A Bayesian Network (Bayes Net) is a graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG)

en.wikipedia.org

. It is a powerful framework for reasoning under uncertainty by encoding joint probability distributions in a factored way. Key aspects to understand: the network structure, conditional independence properties, and how to apply Bayes’ theorem for inference.

Structure: Nodes in the graph correspond to random variables, and directed edges represent direct influences (causal or conditional dependencies)

en.wikipedia.org

. If there’s an arrow from $X$ to $Y$, we interpret that $X$ directly influences $Y$ (in probabilistic terms, $Y$’s probability distribution depends on the value of $X$). Each node has an associated conditional probability table (CPT) that specifies $P(\text{Node} \mid \text{Parents(Node)})$. The graph must have no directed cycles (DAG).

Conditional Independence: The structure encodes conditional independencies: any two variables that are not connected by a path in the graph are independent given some evidence (more formally, d-separation tells us independence relationships). As a simple rule, if two nodes have no path between them, they are independent (given nothing)

en.wikipedia.org

. Even if there is a path, we can often find a set of conditioning nodes that d-separates (blocks) the path, rendering the variables conditionally independent. The power of a Bayes net is that one can read off many such independence relations from the graph structure

en.wikipedia.org

. For example, in a classic network with Burglary -> Alarm <- Earthquake, Burglary and Earthquake are independent of each other (no path between them except through Alarm), but they become dependent when conditioning on the common effect (Alarm) – a phenomenon known as explaining away. Understanding which variables are independent given others is crucial to efficient inference.

Joint Probability via Network: A Bayes net defines a joint probability distribution over all variables that factorizes according to the graph. If the network has variables $X_1,\dots,X_n$, and Parents($X_i$) denotes the set of parent nodes of $X_i$, then the joint distribution is:

P

(

X

1

,

X

2

,

,

X

n

)

=

i

=

1

n

P

(

X

i

Parents

(

X

i

)

)

:

c

o

n

t

e

n

t

R

e

f

e

r

e

n

c

e

[

o

a

i

c

i

t

e

:

30

]

i

n

d

e

x

=

30

.

P(X

1

,X

2

,…,X

n

)=∏

i=1

n

P(X

i

∣Parents(X

i

)):contentReference[oaicite:30]index=30.

This formula means we can compute any joint or marginal probability by multiplying the local conditional probabilities from the CPTs. The graph structure drastically reduces the number of parameters needed: we only need each node’s CPT (which involves its parents), instead of a giant table for the full joint.

Example Structure: Consider a simple network for a home alarm system with nodes: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M). The edges are: $B \to A$, $E \to A$ (burglary or earthquake can set off the alarm), and $A \to J$, $A \to M$ (if alarm sounds, John and Mary may call)

courses.cs.washington.edu

. This structure encodes assumptions: $B$ and $E$ are independent causes of the alarm; $J$ and $M$ only respond to the alarm (and are independent of $B$ and $E$ given $A$). The joint probability would factor as:

$P(B,E,A,J,M) = P(B) P(E) P(A \mid B,E) P(J \mid A) P(M \mid A)$.

From this, we can answer queries like “If the alarm went off and Mary called (but John didn’t), what’s the probability of a burglary?” by using the network to efficiently compute the posterior $P(B|A, M, \neg J)$.

Applying Bayes’ Theorem (Inference): In a Bayes net, inference means computing posterior probabilities of some variables given evidence on others. The network uses Bayes’ Theorem and the chain rule implicitly through its structure. We typically want to answer queries of the form “Given some observed evidence $E$, what is $P(\text{Query} \mid E)$?” For instance, $P(Burglary=true \mid Alarm=true, Earthquake=false)$. The network’s factorization allows efficient computation using algorithms like variable elimination or belief propagation, rather than summing over an exponential number of joint entries. In our example, upon hearing the alarm, we update the probability of burglary vs. earthquake causing it. The network will apply Bayes’ rule:

$P(Burglary \mid Alarm) \propto P(Alarm \mid Burglary) P(Burglary)$, normalized by $P(Alarm)$, while considering earthquake as an alternative cause. Inference Note: When evidence is observed, the Bayes net “propagates” that information. For example, if Alarm = true and Earthquake = false, the probability of Burglary goes up because the alarm still needs an explanation (earthquake was ruled out). This is Bayes’ theorem in action within the network. As another example, if we observe the effect ($A$) and want cause probabilities ($B$ or $E$), the network effectively does: $P(B|A) = \frac{P(A|B)P(B)}{P(A)}$. The big advantage is we don’t have to enumerate all combinations; the graph guides a more efficient calculation

upgrad.com

. Modern AI systems can perform such probabilistic inference quickly even for large networks.

Example Query: Using the alarm network above, suppose: $P(B)=0.001, P(E)=0.002$. The alarm’s CPT might say $P(A|B,E)=0.95$, $P(A|B,\neg E)=0.94$, $P(A|\neg B, E)=0.29$, $P(A|\neg B,\neg E)=0.001$ (alarm can rarely ring accidentally). If we find the alarm ringing ($A$ true) and know there was no earthquake ($E$ false), we can compute the posterior odds of a burglary vs no burglary. By Bayes’ theorem:

$P(B|A,\neg E) = \frac{P(A|B,\neg E)P(B)}{P(A|\neg E)}$. We get $P(A|\neg E) = P(A|B,\neg E)P(B)+P(A|\neg B,\neg E)P(\neg B) = 0.94(0.001) + 0.001(0.999) \approx 0.00194$. So $P(B|A,\neg E) = \frac{0.94 \cdot 0.001}{0.00194} \approx 0.485$. So there’s about a 48.5% chance a burglary caused the alarm given no quake (despite burglary prior being 0.1%). The network efficiently encapsulated these calculations.

Why Bayes Nets are useful: They allow combining prior knowledge and data, handling incomplete data, and performing diagnostic (effect-to-cause) as well as predictive (cause-to-effect) reasoning. They enforce a structured way to represent joint distributions. Moreover, the conditional independence makes reasoning tractable: algorithms leverage the graph structure to avoid brute-force enumeration. By constructing a Bayes net for a domain, you essentially have an expert system that can answer probabilistic queries by applying Bayes’ theorem on the encoded relationships. This is widely used in areas like medical diagnosis, troubleshooting systems, and any AI that needs to reason under uncertainty. (Self-written notes: … )

6. Search Algorithms and Games (Minimax, Alpha-Beta Pruning, Utility Values)

In adversarial games (like chess, tic-tac-toe), AI agents use search algorithms to decide on moves. The classic framework is the Minimax algorithm for optimal play in zero-sum games, often enhanced with Alpha-Beta pruning to increase efficiency. We’ll also discuss utility values, which are the outcomes at terminal states that the algorithm uses to back up decisions.

Minimax Algorithm (Optimal Play in Two-Player Games)

Minimax is a recursive backtracking algorithm to choose moves that maximize an agent’s chance of winning while assuming the opponent plays optimally to minimize that chance

geeksforgeeks.org

. In a two-player zero-sum game, one player is the Maximizer (trying to maximize the final utility or score) and the other is the Minimizer (trying to minimize the final utility)

geeksforgeeks.org

.

The game can be thought of as a tree: nodes are game states; edges are moves. Terminal nodes are end states with a certain utility (win = +1, loss = -1, draw = 0, or more complex evaluations for games like chess).

Minimax principle: Max assumes Min will always choose moves that are worst for Max. So Max picks the move that maximizes its outcome assuming Min responds to minimize that outcome. Formally, at Max nodes we take the max of the values of successor states; at Min nodes we take the min of successors. This propagation finds the minimax value of each node (the guaranteed outcome with perfect play).

Backed-up values: The utility values of terminal states propagate upward. Every Min node returns the minimum of its children’s values (because the opponent will force the outcome to the least favorable for Max), and every Max node returns the maximum (Max will choose the most favorable branch)

geeksforgeeks.org

. By the root, the chosen move is the one leading to the value that Max can guarantee.

Utility Values: In these games, a utility (payoff) function evaluates terminal states from Max’s perspective. Commonly, +1, 0, -1 for win/draw/loss or some heuristic score if not at a true terminal. The minimax algorithm assumes opponent’s utility is the negative of Max’s (zero-sum). So if Max gets +5, Min’s perspective is -5. Utility values at terminal states are the foundation for decision-making – they are the numbers the algorithm tries to maximize or minimize. All non-terminal states’ values are derived from these terminal utilities.

In Minimax trees, we often label leaf nodes with their utility values (for Max). Then interior nodes get labeled with the minimax value (Max nodes the max of children, Min nodes the min of children)

mustafamisir.github.io

. These interior values indicate the eventual outcome assuming optimal play from that point.

Simple Example: Consider a small game tree. Max (root) has two possible moves leading to two branches (nodes A and C). Suppose after a few moves the possible terminal outcomes have known utilities. We can assign values to the leaves:

[ Root (Max) ]

/ \

[ A (Min) ] [ C (Min) ]

/ \ / \

[D(Max)] [E(Max)] [F(Max)] [G(Max)]

/ \ / \ / \ / \

3 5 6 9 1 2 0 -1 <- Utility values at terminals

At the leaves are the utility values for Max (green boxes). Let’s compute minimax values:

Node D (Max) has children worth 3 and 5, so D’s value = $\max(3,5)=5$.

Node E (Max) has children 6 and 9, so E’s value = $\max(6,9)=9$.

Now A is a Min node with children D=5 and E=9, so A’s value = $\min(5,9)=5$ (Min will force the outcome corresponding to 5, not allow 9).

On the right branch: F (Max) yields $\max(1,2)=2$; G (Max) yields $\max(0,-1)=0$.

C (Min) takes $\min(F=2,;G=0) = 0$.

At the root, Max chooses between A’s branch (value 5) and C’s branch (value 0). Max will choose the move leading to value 5 (since 5 > 0). So the minimax algorithm predicts a utility of 5 for the root and that Max should take the branch through A. This is the optimal play assuming the opponent then forces outcome 5.

https://www.geeksforgeeks.org/minimax-algorithm-in-game-theory-set-4-alpha-beta-pruning/

Example of a minimax game tree evaluation. Leaf nodes show utility values for Max. These propagate up: Max nodes take the max of children, Min nodes the min. Here, the left branch (via node A) guarantees Max a value of 5, whereas the right branch (via C) would result in 0, so Max will choose the left branch. Key points: Minimax ensures optimal decisions in perfect-information, zero-sum games. Its drawback is the combinatorial explosion of game trees – it examines every possible move sequence (which is infeasible in games like chess). That’s where pruning comes in.

Alpha-Beta Pruning (Optimizing Minimax)

Alpha-Beta pruning is an optimization that prunes away branches that cannot influence the final decision, thereby reducing the number of nodes evaluated

geeksforgeeks.org

. It keeps track of two values, $\alpha$ and $\beta$, which are bounds on the utility that the players can secure, to avoid exploring moves that won’t be chosen.

Alpha ($\alpha$): The best (maximum) value found so far for Max along the path (the lower bound on Max’s achievable value). Max will never let the value fall below $\alpha$.

Beta ($\beta$): The best (minimum) value found so far for Min along the path (the upper bound on Min’s achievable value). Min will never allow the value to go above $\beta$.

During a depth-first search of the game tree, we carry $\alpha$ and $\beta$ values down:

At a Max node, $\alpha$ might be updated: if we find a child with a value higher than current $\alpha$, we raise $\alpha$ (Max has a better guarantee).

At a Min node, $\beta$ might be lowered: if a child has a value lower than current $\beta$, we lower $\beta$.

Pruning condition: If at any point $\alpha \ge \beta$, it means the current node cannot produce a better outcome than an already explored alternative – so we can prune (cut off) this branch

geeksforgeeks.org

geeksforgeeks.org

. Intuitively, if Max has a guarantee of at least $\alpha$ from another branch and we’re exploring a Min branch that already got $\beta$ (an opponent response) lower than $\alpha$, then Min (being adversarial) would never allow Max to get $\alpha$ through this branch – Min would choose a different move to hold Max to $\beta < \alpha$. Thus, no need to search further down this branch. Pruning does not affect the final result; it only skips unnecessary evaluation. Alpha-Beta example: Using the previous tree, we illustrate pruning.

https://www.geeksforgeeks.org/minimax-algorithm-in-game-theory-set-4-alpha-beta-pruning/

Alpha-beta pruning in action on the same game tree. Red values show the bounds and decisions: At node B (Min), after evaluating D’s value = 5, $\beta$ for B becomes 5

geeksforgeeks.org

. Then at E, the first child gave 6 (so at E, $\alpha=6$ for Max)

geeksforgeeks.org

. When E sees a child with value 6, and B’s $\beta$ is 5, we get $\beta \le \alpha$ – meaning the Min node B already has an option (5) that is lower than what Max is exploring (≥6), so Min won’t allow Max to achieve 6. Thus, the branch leading to 9 is cut off (pruned) without evaluating it

geeksforgeeks.org

. In the above figure, the node with value 9 was never evaluated because alpha-beta detected it was fruitless: once one child (6) of E was found, B (the Min parent) knew it had a guaranteed option of 5 elsewhere, so it would not allow the 6 branch if it got any worse (and indeed 6 > 5, so B wouldn’t choose E’s branch at all). Therefore, the 9 could not possibly be chosen by Min, and the algorithm skips calculating it. Similarly, on the right side of the tree, once C (Min) sees that one child (F) is valued 2 which is already worse for Max than the left branch’s 5, if another child (G) promises to be even worse (say it starts with value 0), we can prune further children of C. In practice, alpha-beta can drastically reduce nodes—often allowing search twice as deep as naive minimax in the same time, by eliminating large parts of the tree

geeksforgeeks.org

. Utility Values revisited: In minimax (with or without pruning), the actual numbers at terminal nodes (utilities) drive the decisions. They come from a heuristic evaluation function if the game is complex (e.g., in chess, when depth-limited, we evaluate non-terminal states with a heuristic score). For simplicity, our examples have used final outcomes with small integer utilities. In any case, the algorithm will assume these utilities accurately reflect desirability. In a zero-sum game, one player’s utility is the negative of the other’s; alpha-beta inherently assumes this as it prunes based on worst-case values for one player vs best-case for the other. Summary:

Minimax provides the optimal move assuming perfect play. It systematically explores the game tree (which can be huge).

Alpha-Beta uses bounds ($\alpha$, $\beta$) to cut off branches that won’t be chosen, without affecting the result

geeksforgeeks.org

. In the best case it reduces complexity from $O(b^d)$ to about $O(b^{d/2})$ (where $b$ is branching factor, $d$ depth).

Utility values define what outcomes are preferred. The search algorithm’s goal is to maximize these (for Max). They are given by game rules (win/lose) or heuristics.

By understanding minimax, one can craft better heuristics and know why an AI makes a move. And by using alpha-beta, the AI searches much faster, enabling lookahead in more complex games (like examining 6 moves ahead in chess instead of 4, which is a huge difference in performance). Modern game AIs combine these with other enhancements (iterative deepening, transposition tables, etc.), but the foundation remains minimax search and utility optimization.