# [2023 AI/Machine Learning Topics (CS Core)](https://csed.acm.org/wp-content/uploads/2023/03/Version-Beta-v2.pdf)

## Definition and Examples of a Broad Variety of Machine Learning (ML) tasks

### Supervised Learning

[Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) is essentially [function approximation](https://en.wikipedia.org/wiki/Function_approximation) or [curve fitting](https://en.wikipedia.org/wiki/Curve_fitting) that is generalized to many [data types](https://en.wikipedia.org/wiki/Statistical_data_type) (e.g. [nominal, ordinal, discrete, continuous](https://en.wikipedia.org/wiki/Statistical_data_type#Simple_data_types)) of possibly many dimensions.  Given a set of pairs of inputs with associated outputs (i.e. "feature vectors labeled with labels"), compute a function (i.e. "learn a model") that minimizes expected output prediction error according to a loss function.

#### Classification

[Classification](https://en.wikipedia.org/wiki/Statistical_classification) is supervised learning for discrete outputs.

Examples:
* [Spam filtering](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering): Given a training set of message [word frequencies](https://en.wikipedia.org/wiki/Bag-of-words_model) labeled "spam"/"not spam", learn to detect (predict) spam from unseen message word frequencies.
* [Optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition#Text_recognition): Given a training set of character images labeled with characters represented, learn to recognize (predict) which character is represented from unseen character images.  A common example is the benchmark [MNIST database](https://en.wikipedia.org/wiki/MNIST_database) of handwritten digit characters.
* [Facial recognition](https://en.wikipedia.org/wiki/Facial_recognition_system): In the last stage of a multistage [deep learning face recognition system](https://arxiv.org/abs/1804.06655), given a set of face representations labeled with IDs, learn to associate unseen, varied instances of face representation to the correct IDs.

#### Regression

[Regression](https://en.wikipedia.org/wiki/Regression_analysis) is supervised learning for continuous outputs.

Examples:
* [Housing Price Prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) (see also [Boston](https://scikit-learn.org/1.0/modules/generated/sklearn.datasets.load_boston.html) and [California](https://scikit-learn.org/1.0/modules/generated/sklearn.datasets.fetch_california_housing.html) datasets): Given a set of house feature vectors (including, for instance, square footage, acreage, number of bedrooms, etc.) labeled with selling prices, predict the selling price of a house given its feature vectors.
* TODO - more examples

### Reinforcement Learning (RL)

[Reinforcement Learning](https://en.wikipedia.org/wiki/Reinforcement_learning) algorithms learn how to map *states* to *actions* that yield *reward* and *next state* probabilities so as to _maximize expected cumulative future rewards_.  Problems are usually models as [Markov Decision Processes (MDPs)](https://en.wikipedia.org/wiki/Markov_decision_process), so algorithms often make use of [Bellman's optimality equations](https://en.wikipedia.org/wiki/Bellman_equation).

Introductory textbook: Sutton, Richard S., Barto, Andrew G.  [Reinforcement Learning: An Introduction. 2nd ed.](http://incompleteideas.net/book/the-book-2nd.html), 2020. ([PDF](http://incompleteideas.net/book/RLbook2020.pdf))

* TODO - examples

### Unsupervised learning

[Unsupervised Learning](https://en.wikipedia.org/wiki/Unsupervised_learning), in contrast to supervised learning, concerns algorithms for learning patterns in _unlabeled_ data, i.e. input feature vectors with _no_ outputs.  Example problems include [clustering](https://en.wikipedia.org/wiki/Cluster_analysis) (e.g. [k-Means Clustering](http://modelai.gettysburg.edu/2016/kmeans/)) and [anomaly detection](https://en.wikipedia.org/wiki/Anomaly_detection).

* [Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is grouping a set of objects such that objects in the same group (i.e. cluster) are more similar to each other in some sense than to objects of different groups.  
** In the context of [k-Means Clustering](http://modelai.gettysburg.edu/2016/kmeans/), objects are real-valued vectors, more similar objects have lesser Euclidean distance, and we seek to compute and assignment of each vector to one of $k$ cluster groups such that the sum of squared distances from each point to its cluster _centroid_ (mean cluster point) is minimized, i.e. a minimal within-cluster sum-of-squares (WCSS). The [archived OnMyPhd website](https://web.archive.org/web/20210614173002/http://www.onmyphd.com/?p=k-means.clustering) describes and visualizes the k-Means Clustering algorithm that chooses initial centroids and then iteratively, alternately remaps points to their closest centroids and recomputes centroids based on the current cluster assignments until such iteration converges to a fixed local (but not necessarily global) WCCS minimum
** In this [Scikit-Learn comparison of different clustering algorithms on "toy" datasets](https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html) 2D colored scatter plots illustrate strengths and weaknesses of different algorithms for different datasets.  

* TODO - more examples

## Fundamental ML ideas:

* *The NFL (No Free Lunch) Theorem of Machine Learning* - Wolpert and Macready summarize their ["No Free Lunch Theorem"](https://en.wikipedia.org/wiki/No_free_lunch_theorem) as stating that "any two optimization algorithms are equivalent when their performance is averaged across all possible problems".  This is not to say that practical problems do not exhibit structures, regularity, or tendency towards simplicity ([Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor)) for which optimization algorithms (such as ML algorithms) offer performance advantages.  Rather, that without nonuniform prior probabilities over target functions to be approximated, we have no reason to believe any one learning algorithms is "better" than another.  See the ["Implications"](https://en.wikipedia.org/wiki/No_free_lunch_theorem#Implications) section of the Wikipedia article to consider the NFL theorem implications in the context of [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). 

* *Undecidability of ML* - In Ben-David, S., Hrubeš, P., Moran, S. et al. [Learnability can be undecidable](https://doi.org/10.1038/s42256-018-0002-3). Nat Mach Intell 1, 44–48 (2019), the authors constructed an "Estimating the Maximum Problem" (EMX) learning model, and showed that a family of problems functions whose learnability in EMX is undecidable in standard set theory.  They related learnability to a compression problem which in turn is related to a question on the cardinality of a set, i.e. whether or not it is uncountable.  Given the connection between learnability, compression, and cardinality, and given that many statements regarding cardinalities are undecidable, learnability in some cases is undecidable.  An [undecidable problem](https://en.wikipedia.org/wiki/Undecidable_problem) is one for which it can be proven that no algorithm exists that can provide a correct "yes"-or-"no" answer.  

* *Sources of error in ML* - Three primary sources of error in ML are:
    * *Noise* (a.k.a. "irreducible error", error variance in [Introduction to Statistical Learning, Chapter 2](https://www.statlearning.com/)) in data which cannot be be addressed through better choice of model or learning algorithm.
    * *Bias* from the choice of a ML model that is simpler than the true model underlying the data, e.g. error from a linear regression of nonlinear data.
    * *Variance* from the way an ML model may vary in what is learned from one dataset to another, even from the same underlying source of data.

To summarize, our best efforts to learn best ML models must necessarily make prior assumptions of a nonuniform probability over possible functions, can suffer from undecidability, and, even when we have minimized bias and variance error through the choice of most appropriate models, may still have irreducible error in the form of noise.  Nevertheless, while theoretical results and sources of error may challenge and limit, successful applications of ML provide evidence of the practical benefits of ML techniques.  Worst case assumptions, analyses, and proofs can temper but not extinguish the optimism brought by the impressive artifacts of this discipline.  


## Simple statistical-based supervised learning such as Naive Bayes, Decision trees
(Focus on how they work without going into mathematical or optimization details; enough to understand and use existing implementations correctly)

### Naive Bayes
[Naive Bayes classification](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is one of the simplest classification methods in use.  Based on [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) and an assumption of [conditional independence](https://en.wikipedia.org/wiki/Conditional_independence) between every pair of input features.  For each possible class output, we multiply the probability of that class output times each of the probabilities of having the given input feature with that output.  We then choose the class for which this product is maximized for our classification. 

Example Python use of Naive Bayes classification:
* [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [Datacamp tutorial](https://www.datacamp.com/tutorial/naive-bayes-scikit-learn)

### Decision Trees
In the Number Guessing Game, one person chooses a secret integer in a range (e.g. 1 - 100) and the other player seeks to guess the secret number in as few guesses as possible, with each guess followed by the feedback "higher", "lower", or "correct".  A good player of the game starts with a guess in the middle of possibilities resulting in either a correct guess (rarely), or a reduction in the possible range by half.  In contrast, a poor player could guess the numbers in order ("1", "2", etc.).  What makes a middle guess good is that it maximizes information gained by evenly splitting possibilities.

Similarly, in supervised learning, we may construct a [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) by asking a question of a single input feature at the root node, dividing cases down possible answer branches, and repeating this process at subsequent child nodes.  Each tree node question is generally selected to best gain information about probable outputs.  Stopping conditions for the recursive growth of the decision tree may vary, but each leaf node of the tree can use its relevant subset of input/output cases to predict an output, either through classification (e.g. most popular of class outputs) or regression (e.g. linear regression on node cases).

Example Python use of Decision Trees:
* [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/tree.html)
* [Datacamp tutorial](https://www.datacamp.com/tutorial/decision-tree-classification-python)
* [Kaggle tutorial](https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model)

## The overfitting problem and controlling solution complexity (regularization, pruning – intuition only)
### The bias (underfitting) - variance (overfitting) tradeoff

The [Bias-Variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) concerns the tension between reducing model choice bias, i.e. underfitting the function due to choice of overly simplistic model functions, and avoiding variance, i.e. overfitting the model to the noise (i.e. "irreducible error") by choice of overly complex model function.  An example of too much bias would be a linear regression of data generated from a cubic function with noise.  An example of too much variance would be a very high-degree polynomial regression that seeks a tight fit to the same noisy data.  

Examples:
* [Scipy Lecture Notes](https://scipy-lectures.org/packages/scikit-learn/auto_examples/plot_bias_variance.html)
* [Aqeel Anwar Python polynomial illustration](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-and-visualizing-it-with-example-and-python-code-7af2681a10a7)

[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a technique for introducing a bias towards learning simpler models by penalizing more complex models via the loss function we seek to minimize when learning the model.  For example, one might introduce penalties for having more non-zero model parameters, keeping models simple and sparse, or for having large model parameters that would make the learned function less smooth.  [Early stopping](https://en.wikipedia.org/wiki/Early_stopping) in learning by terminating model learning before the model overfits (or after there are indications of overfitting and reversion to an earlier iteration model) can be viewed as a form of regularization in time.

[Pruning in neural networks](https://en.wikipedia.org/wiki/Pruning_(artificial_neural_network)) involved the removal of edges (weights) and nodes (neurons) that are least important by some measure in order to move towards a more efficient, less overfitting model.  Decision tree growth stopping conditions (e.g. maximum depth limits, node purity thresholds, etc.) are another means for limiting variance from overfitting.


## Working with Data

### Data preprocessing

Recommended reading:
* [Scikit-Learn's user guide for Preprocessing Data](https://scikit-learn.org/stable/modules/preprocessing.html)

#### Importance and pitfalls of

Data Science work was once jokingly described to me as "90+% data wrangling followed by a dozen lines of Python code and you're done."  While this is hyperbole and oversimplification, students of Machine Learning often take data preprocessing for granted.  Toy problems in texts are small, easily visualized, and successful ... mainly for illustration.  In real-world applications, data is often messy, complicated, and prone to sources of errors beyond noise, bias, and variance such as:

* Data Collection Error - A person filling out a survey may misunderstand a question or supply erroneous information.  A sensor may supply an incorrect reading.  Errors in data can happen at the source.
* Data Entry Error - A person transcribing, transmitting, or transferring data into electronic form may introduce errors, e.g. typos, numeric information supplied in incorrect units, failure to create data entries to given standards, etc.
* Data Storage/Retrieval Error - Data can be corrupted in the storage and retrieval process.
* Missing Values - When data is missing, how one handles such data can introduce related errors.
* Scaling Error - When different features have very different orders of magnitude, some models will exaggerate the relative importance of large-magnitude features.  Normalization/standardization of data that gives insufficient attention to the distribution of such data can result in inappropriate scaling.

Put simply, "garbage in, garbage out" is the phrase that best describes the result of poor data preprocessing.  Careful preparation of data is a prerequisite to successful Machine Learning.

### Handling missing values (imputing, flag-as-missing)
(Implications of imputing vs flag-as-missing)

Missing values in data can happen for a variety of reasons.  Data is unavailable at time of collection.  A person may not supply data for privacy or other concerns.  Data merged from different surveys may supply different subsets of features.  For whatever reasons values may be missing, how one handles such data has implications for the quality of one's learned models.  Interestingly, *there is no simple, agreed-upon recipe for handling missing data*.  

If one has a large number of complete data instances, one might simply drop all instances with missing data.  If a small number of features have many missing values, we may drop such features.  However, the data that is missing can also tell a story.  A person declining to answer a survey question or selecting a "decline to answer" option can be communicating important information.  Silence can be informative, so explicitly creating a "_missing_" categorical value, or a "sentinel" numeric value outside the normal range of values, or a separate feature _flagging_ the prior numeric column as having a missing value, is a way of _embracing_ missing data and letting it remain missing.

In a different approach, it is sometimes possible to "fill in the blanks".  This is called _imputation_ of missing values, and it must be practiced with care.  In some of the recommended readings below, it is suggested that one should impute the mean or median for missing values.  However, be forewarned: While often practiced, this is considered statistically invalid, as it effectively reduces the variance of the data.   It is better to build a _predictor_ of missing values from the other data and use such a predictor to impute missing values so as to avoid spiking middle value frequencies and artificially reducing the data's variance.  

That said, there is no perfect methodology.  This is why understanding the data and understanding your intended modeling is of greatest importance.  The first will guide a common sense approach for whether one should prefer to drop, embrace, or impute missing data.  The second will constrain the prior choices based on the flexibility of the modeling method.  For example, whereas some models can better handle missing values (e.g. decision-tree-based methods) than other models (e.g. neural networks).

Recommended reading: 

* [Scikit-Learn's user guide for Imputation of Missing Values](https://scikit-learn.org/stable/modules/impute.html)
VanderPlas, Jake. [Python Data Science Handbook: Essential Tools for Working with Data](https://jakevdp.github.io/PythonDataScienceHandbook/). United States: O'Reilly Media, 2016. [Chapter 3 section on "Handling Missing Data"](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)
* Leanne and Justin's ["Data Cleaning in Python: the Ultimate Guide (2020)"](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d)
* John Sullivan's ["Data Cleaning with Python and Pandas: Detecting Missing Values"](https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b)
* Matthew Brem's [summary of statistical pros and cons of different missing values methods](https://github.com/matthewbrems/ODSC-missing-data-may-18/blob/master/Analysis%20with%20Missing%20Data.pdf).

### Encoding categorical variables, encoding real-valued data

Recommended reading:
* Introductory definition sections for Wikipedia's [One-Hot](https://en.wikipedia.org/wiki/One-hot) and [Dummy Variable](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) and [Power Law](https://en.wikipedia.org/wiki/Power_law) entries
* [Scikit-Learn's OneHotEncoder documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and [user guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features)


### Normalization/standardization

Generally, normalization/standardization is a transformation of the data so as to make it behave as a normal distribution with mean 0 and standard deviation 1, or to change the range to be [-1, 1].  As a motivating example, consider a $k$-Nearest-Neighbor binary classification model for $k = 7$ for continuous inputs $x_1 \in [0, 10]$ and $x_2 \in [0, 10000$.  Such a model stores all training data and makes a classification for a new point $(x_1, x_2)$ as follows: Find the 7 points closest (according to Euclidean distance) to $(x_1, x_2)$ and choose the class that the majority of those 7 belong to.   If the data is uniformly distributed, in the given ranges, then points will tend to be much farther apart in the $x_2$ feature, making $x_2$ closeness have disproportionate influence on classification. 

Recommended reading:
* [Scikit-Learn's preprocessing example "Compare the effect of different scalers on data with outliers"](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html)
* [FAQ: Should I normalize/standardize/rescale the data?](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)
    * Two of the most useful ways to standardize inputs are: 
        * Mean 0 and standard deviation 1 
        * Midrange 0 and range 2 (i.e., minimum -1 and maximum 1) 


### Emphasis on real data, not textbook examples

## Representations
### Hypothesis spaces and complexity
### Simple basis feature expansion, such as squaring univariate features
### Learned feature representations

## Machine learning evaluation
### Separation of train, validation, and test sets
### Performance metrics for classifiers
### Estimation of test performance on held-out data
### Tuning the parameters of a machine learning model with a validation set
### Importance of understanding what your model is actually doing, where its pitfalls/shortcomings are, and the implications of its decisions

## Basic neural networks
### Fundamentals of understanding how neural networks work and their training process, without details of the calculations
* [Tensorflow Playground](https://playground.tensorflow.org/)

## Ethics for Machine Learning

Reading: 
* [Lo Piano, S. Ethical principles in machine learning and artificial intelligence: cases from the field and possible ways forward. Humanit Soc Sci Commun 7, 9 (2020).](https://www.nature.com/articles/s41599-020-0501-9)
### Focus on real data, real scenarios, and case studies.
### Dataset/algorithmic/evaluation bias
* Relevant [Model AI Assignments](http://modelai.gettysburg.edu) with associated resources:
  * [Analyzing the COMPAS Recidivism Algorithm by Raechel Walker, Matthew Taylor, Olivia Dias, Zeynep Yalcin, and Dr. Cynthia Breazeal](http://modelai.gettysburg.edu/2023/compas/)
  * [Reflecting on Bias by Chris Brooks](http://modelai.gettysburg.edu/2022/bias/)
  * [Detecting Bias in Language Models by Ameet Soni and Krista Thomason](http://modelai.gettysburg.edu/2020/weat/)
  * [Exploring Unfairness and Bias in Data by Jonathan Chen, Tom Larsen, and Marion Neumann](http://modelai.gettysburg.edu/2020/bias)
  * [A Module on Ethical Thinking about Autonomous Vehicles in an AI Course](http://modelai.gettysburg.edu/2018/ethics/index.html)
