1. Black Box Machine Learning

Summary of 1. Black Box Machine Learning

00:00:07
David Rosenberg, from Bloomberg, introduces a foundational machine learning course aimed at equipping participants to become expert practitioners in data science. The course will cover supervised machine learning techniques and utilization of machine learning libraries as black boxes initially. Subsequent weeks will delve deeper into understanding the mathematical and computational workings of machine learning algorithms. The emphasis is on practical application rather than theoretical proofs, with a focus on gaining intuition and insight for good practice. The first class explores the concept of "black box machine learning," highlighting the ability to achieve impactful results without full understanding of the underlying algorithms, provided proper infrastructure and usage. The session will touch on the basics of machine learning, how to execute it effectively, potential challenges, and will include a case study for practical application exercises. The general structure of machine learning involves providing input data (X) to generate desired output (Y), often in the form of predictions.
00:03:36
Machine learning involves creating prediction functions that take inputs and produce outputs, with common types including binary classification (such as spam detection), multi-class classification (like medical diagnosis with multiple possible outcomes), and regression (predicting numbers like stock prices). Machine learning differs from rule-based approaches like expert systems from the 80s. The focus in machine learning is on creating prediction functions to solve problems by leveraging data and algorithms to make predictions.
00:07:11
The passage discusses expert systems in black box machine learning, where rules are created based on expert knowledge to make decisions. This process involves a labor-intensive rule-writing iteration cycle that can be fragile and fail to generalize to unforeseen inputs. Expert systems also struggle to handle uncertainty and produce probabilistic predictions, making them less reliable and robust compared to machine learning systems, especially in dynamic environments like financial markets.
00:10:52
Training sets that cannot handle unseen situations can be problematic for machine learning systems, especially in areas like market prediction where each day is unique. Machine learning systems are better at generalizing to unforeseen situations and can be easily adjusted to changing circumstances compared to rule-based systems. Supervised learning involves training the system with pairs of input and output examples, allowing the algorithm to infer the rules on its own. In the event of changes in the environment, machine learning systems can be updated by adding new training data, while rule-based systems require manual rule adjustments.
00:14:06
Machine learning algorithms are known for their ability to generalize well, making them popular for various applications. The algorithm takes in training data and generates a prediction function, which is deployed to solve specific problems. The key concepts in machine learning include classification types (binary, multi-class), prediction functions, training data pairs, and feature extraction to convert raw inputs into fixed arrays of numbers for algorithm processing. The goal of machine learning is to create prediction functions that can handle diverse input types like text, images, and sound files, despite the challenge of working with structured numerical data.
00:17:45
In the context of machine learning, input features play a crucial role in the training of algorithms. These features are representations of data that are fed into the prediction function. Good features are essential for the algorithm to understand the patterns and make accurate predictions. The more informative and structured the features are, the easier the job becomes for the machine learning algorithm. Human expertise is often required to design effective features, as different algorithms may handle features differently. Feature vectors are inputs to prediction functions, and designing them effectively can significantly impact the algorithm's performance. An example task could be determining if a string is a valid email address, which can be approached using a machine learning method with carefully chosen feature extractors.
00:21:05
The input string is transformed into a vector of numbers for processing, with each element representing a specific feature. The idea of a feature template allows for a more systematic approach in generating features, such as creating one-hot encodings for all possible three-letter suffixes. One-hot encoding is a method where only one feature is active (set to 1) while the rest are set to 0. Categorical variables, which can take on discrete values, can also be encoded using one-hot encoding or dummy encoding. In the case of encoding the five boroughs of New York City, only four features are needed to capture the information without redundancy.
00:24:29
In machine learning, labeled data refers to pairing a D-dimensional feature vector with its correct output label, creating a labeled example. A prediction function takes a feature vector and outputs a label or prediction response. Feature extraction transforms input into numeric values for the algorithm's use. Evaluation of a prediction is done using a loss function, which measures the difference between the predicted and target outputs. Learning algorithms use labeled data to produce prediction functions, emphasizing the importance of evaluating and improving predictions.
00:28:04
The video discusses the concept of loss functions in machine learning, which evaluate the performance of prediction models by measuring the difference between predicted and actual values. Common loss functions include the 0-1 loss for classification tasks and the square loss for regression tasks. The goal is to minimize the loss, indicating good performance of the prediction function. To accurately assess performance, it is important to test the model on new data (a test set) separate from the training data, to avoid bias from evaluating on data the model has already seen.
00:31:11
In the process of black box machine learning, the data is divided into a training set for building the model and a test set to evaluate the performance of the prediction function. The test set helps determine if the prediction function is reliable enough to be deployed in real-world scenarios. The size of the test set impacts the accuracy of the performance assessment, with smaller test sets providing less accurate estimates but larger test sets potentially offering more precise evaluations. There is a trade-off between the size of the test set and the quality of the prediction function when working with small data sets. Additionally, the distinction between the train-test scenario and the train-deploy scenario involves simulating real-world deployment by testing the prediction function before using it in practice.
00:34:04
The key to successful real-world machine learning is ensuring that the performance of your model on test data is a good estimate of its performance in deployment. To achieve this, it is crucial to split your data randomly into training and test sets, as biased selections can lead to inaccurate results. In time series prediction problems, random splitting of days into training and testing may not be suitable as it can result in predicting the past based on the future. In such cases, careful consideration of the train-deploy scenario is necessary to avoid data leakage and ensure accurate model performance.
00:37:45
The Pritchard function was built based on historical data for training and testing purposes. To ensure the train-test split reflects the real-world scenario closely, a point in time T is chosen to split the data into training and testing sets. The model is then evaluated on the test set before deployment. To improve performance, a new model can be built using the combined training and test data before deployment to utilize more data and potentially enhance the model's predictive ability. Using more data in training can lead to better performance on the test set and ultimately improve the model's deployment performance.
00:41:20
The speaker discusses the importance of evaluating machine learning models on a fresh test set to ensure unbiased performance. They highlight the need to maintain consistency in training data and prediction function when introducing new features or changes. The idea is to combine the training and test sets to capture any variations and characteristics that may affect model performance. Randomization is suggested as a method to achieve this, although it may not always guarantee similar results.
00:44:45
The speaker discusses the importance of including more recent data in a test set for better performance evaluation, as the performance may change over time when the model is deployed in the real world. They suggest monitoring the model's performance in the wild and detecting any changes or decreases in performance. The concept of non-stationarity is introduced, particularly covariate shift, where the inputs or features in the model change over time, impacting the model's performance. Monitoring and adapting to these changes is essential for ensuring the model's continued effectiveness.
00:48:43
Concept drift refers to the phenomenon where the correct output may change for a particular input over time, even if the input distribution remains the same. For example, a shopper's interest in clothing items may shift from winter coats to short sleeve shirts and shorts after the winter season. To address concept drift, one approach is to include relevant factors like seasonality in the input data. K-fold cross-validation is an alternative to the traditional train-test split, especially useful when working with small labeled datasets to more accurately assess model performance by partitioning the data into multiple subsets for training and testing.
00:52:04
In k-fold cross-validation, data is divided into k subsets and the model is tested on data it wasn't trained on, ensuring a reliable performance measure. By averaging performance measures across the folds, we get a more accurate estimate of the model's generalizability. However, it's important to be cautious with small datasets and understand potential issues with k-fold cross-validation. The final step after obtaining the performance estimate is to train the model on all the data to achieve optimal performance.
00:55:19
The speaker discusses the process of using a black box machine learning algorithm to create a prediction function by repeating the same model building procedure five times. The goal is to assess the performance of the model in a training and deployment scenario. The emphasis is on evaluating the model's performance metrics rather than tuning parameters. It is highlighted that the optimal model for deployment may not be one of the five assessments, but rather the result of applying the model building procedure to the entire dataset. Additionally, the speaker briefly mentions full word chaining for time series cross validation as an approach similar to traditional train-test cross validation.
00:58:47
In cross-validation, you can average performance across folds to estimate overall performance, but considerations include different sizes of training data. Cross-validation takes longer as it builds multiple models, but it helps in model selection. An intern should create a validation set from training data to compare the performance of different machine learning algorithms. This set is similar to a test set but used for choosing the best algorithm.
01:02:37
This passage discusses the use of various modeling algorithms and prediction functions in machine learning. It emphasizes the importance of properly evaluating model performance using techniques such as cross-validation and avoiding data leakage. The text also touches on the significance of avoiding overfitting the test set when refining models.
01:07:15
The speaker emphasizes not examining test set performance closely to avoid biasing the model building procedure. The finite class lemma warns against overfitting by testing too many models on the test set, leading to a large confidence interval that may falsely identify the best performing model. The importance of careful model selection to prevent leakage, where irrelevant information influences the model's predictions, is highlighted. This can skew the performance evaluation and create a false sense of model effectiveness.
01:10:38
One common issue in machine learning is leakage, where information about the label leaks into the feature set. For example, including sales commission as a feature when trying to predict sales outcomes would be considered leakage. Another example is using star ratings from Yelp reviews as input for sentiment prediction, which would also be leakage. Sample bias is another important concept, referring to differences in data selection between training, testing, and deployment sets. For instance, using phone surveys only from landlines to model voting patterns would introduce bias, as landline users may not be representative of the entire population.
01:13:51
The speaker discusses issues with creating a training set using a subset of companies on the stock exchange, highlighting survivorship bias as a concern where only existing companies are selected and potential bankruptcy or removal from the exchange is not accounted for. This can lead to discrepancies between training and deploy scenarios. The concept of overfitting in machine learning is then introduced, with a toy example of regression using a polynomial prediction function to fit data points generated by a sine wave with added noise. Parameters (W values) are adjusted during the learning process to approximate the underlying function.
01:17:13
The learning algorithm used here is a polynomial fitting function that takes input data, training labels, and an integer M to determine the order of the polynomial. The function returns an array of parameter estimates (W's) that represent the model. The algorithm can also predict polynomial values using the parameters and degree M. Hyperparameters, like M, are set by the data scientist and can influence the fitting of the model. The choice of hyperparameters can impact the performance of the model, leading to underfitting or overfitting. Testing different hyperparameters on a validation set helps find the best settings for the model.
01:20:52
In this video, the speaker discusses the importance of ensuring that the data used for model training is representative of the data that will be encountered during deployment or testing. Overfitting occurs when a model is too complex and fits the training data too closely, leading to poor performance on new data. The concept of model complexity and generalization is explained visually as more complex functions being able to fit noisy data better but potentially overfitting. It is emphasized that finding a balance between model complexity and fit to training data is crucial to avoid overfitting. Ways to address overfitting include reducing model complexity (e.g., using a smaller polynomial) and obtaining more training data. Examples are provided to illustrate how model performance can improve with increased data points and reduced complexity. The video hints at further exploration of these concepts in upcoming discussions.
01:24:18
The discussion highlights a common concern in machine learning regarding the interpretability of complex models like ninth-order polynomials. While traditional methods like linear regression and decision trees offer intuitive explanations, modern machine learning techniques can handle high parameter numbers efficiently through regularization. The speaker mentions that neural networks, despite having a high parameter-to-data ratio, can perform well due to various optimization techniques, presenting a mystery in their effectiveness.
01:28:39
Machine learning algorithms have hyperparameters that control model complexity, regularization techniques like l1 and l2, optimization algorithms, learning rates, types of models, loss functions, and kernel types. These hyperparameters can be adjusted like knobs to tune the model. It's recommended to split data into training, validation, and test sets, especially if there is plenty of data, to avoid overfitting and get accurate performance estimates. Experimenting with hyperparameters and different models is encouraged as long as proper data splitting and rigor are maintained. K-fold cross-validation may not be necessary with sufficient data.
01:31:49
The process involves working with a large dataset, setting aside fixed-size evaluation and test sets, and iteratively refining the model through feature extraction, selecting machine learning algorithms and hyperparameters, evaluating on validation sets, and retraining the model until satisfactory performance is achieved. The choice of feature extraction methodology and hyperparameter settings can be viewed as an optimization problem that can potentially be automated. There is ongoing research to improve hyperparameter tuning algorithms to optimize model performance. Hyperparameters often require discrete tuning, which is different from the optimization procedures used in the machine learning algorithm itself.
01:35:44
In the context of optimizing machine learning models, it is crucial to consider the settings and approaches that can be used to streamline the process. While optimization algorithms play a significant role, human input is still valuable, especially in domains where feature extraction can be easily parameterized. For instance, neural networks excel in extracting features from images by optimizing functions automatically. However, in scenarios like email text analysis, it is challenging to create a smooth and comprehensive parameterized function for all types of feature extractions. This highlights the importance of human intervention in defining features. Utilizing existing machine learning libraries with diverse options can simplify the process, even in the black box setting. Understanding the internal mechanisms of machine learning algorithms can aid in effectively tuning parameters and preparing data for optimal performance.