My Machine Learning Adventure (A Series)
Welcome back (previous post)! My journey to understand machine learning began a year ago, but as someone who has always been curious about the capabilities of AI, I’ve only recently made any real beginner breakthroughs. My new approach is to interactively work with various GPT models and formats to design an interactive machine learning model that learns directly through my inputs. In this blog series, I document my experiences, the challenges I faced, and the solutions I discovered along the way.
In the first post, I introduced the project and shared my excitement about the concept of active learning. Today, we’ll get our hands dirty by setting up the environment and taking the first steps towards building our interactive machine learning model.
Remember, I’m no machine learning expert. I’m learning just like you!
Initial Environment + Library Installation
Quick hits: I used Jupyter Notebook and JupyterLab, as installed in the Anaconda Navigator Package. The coverage of those particular processes will not be extensive in this series, but maybe I’ll try to highlight them in the future.
Before building my model, I needed to set up the necessary tools. With the assistance of Machine Learning GPT, I chose the following set of libraries:
- modAL: A modular active learning framework for Python.
- scikit-learn: A popular machine learning library.
- numpy: A library for numerical computations.
- pandas: A powerful data manipulation tool.
- joblib: Used for saving and loading models.
I installed them in JupyterLab by opening a terminal in my project folder environment and entering the following command:
pip install modAL scikit-learn numpy pandas joblib
This installs the libraries in the environment so they are not rerun in the code.
These libraries provide a solid foundation for the project, allowing efficient data handling and building robust machine learning models.
Creating and Saving the Initial Dataset
With the libraries installed, it was time to create some data. I used make_classification
from scikit-learn to generate a synthetic dataset that simulates a binary classification problem, such as whether a picture is or is not a dog, or whether sentiment in text is positive or negative.
This dataset served as the foundation for training and testing the model. Imports required:
import numpy as np
import joblib
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
The next portion of code generates a dataset that simulates a binary classification problem as mentioned earlier.
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)
By saving the dataset using joblib
, I ensured that it could be easily loaded and used later in the project.
# Save the dataset
joblib.dump((X, y), 'X_y_data.pkl'
The data was then split into training and pool sets, where the training set would initialize the model, and the pool set would be used for querying.
# Split the dataset
X_train, X_pool, y_train, y_pool = train_test_split(X, y, test_size=0.75, random_state=42)
Let’s break down what the previous block of code does in a way that’s easier to understand, because I struggled with it.
Imagine you have a big list of data (like numbers, images, or text), and you want to use this data to teach a computer to do something, like recognizing cats in pictures or predicting how much your favorite sports team will score in a game. This dataset has x “input” data (like pictures of animals) and y labels (such as “dog” or “cat”).
You can’t use all your data to teach the computer (train it). You need to keep some data aside to test if the computer has learned correctly.
This code splits your data into two parts: one part for training and one part for testing or other purposes.
- The Code Breakdown
train_test_split
is a function that does the splitting for you.
X_train
and y_train
are the parts of your data that you’ll use to train the computer.
X_pool
and y_pool
are the parts of your data that you’ll keep aside for later (like testing).
- Parameters in the Code
X
and y
: Your whole dataset.
test_size=0.75
: This means 75% of your data will be kept aside (in X_pool
and y_pool
), and 25% will be used for training (in X_train
and y_train
).
random_state=42
: This is like setting a “seed” so that if you run this code again, you get the same split every time. It’s useful for consistency.
Initializing the Active Learner Model
Even though I didn’t realize what I’d done at the time, I’d generated a synthetic dataset and split the data into training and testing data. The next step was to initialize the active learner model.
I chose the RandomForestClassifier
from scikit-learn as the base estimator for my active learner.
from modAL.models import ActiveLearner
from sklearn.ensemble import RandomForestClassifier
Keeping with our cats/dogs categorizing example, a learner is like a student that learns from these examples. Active learning is a special way of teaching the computer. Instead of showing it all the pictures at once, you start by showing it a few, and then it asks for more examples of things it is unsure about. This way, it learns more efficiently.
# Initialize the learner
learner = ActiveLearner(
estimator=RandomForestClassifier(),
X_training=X_train, y_training=y_train
)
I saved the model and data.
# Save the model and the pool data
joblib.dump(learner, 'active_learner_model.pkl')
joblib.dump((X_pool, y_pool), 'X_y_pool.pkl')
The model was initialized with the training data and saved for future use. This step marked the completion of our initial setup, providing a solid base for building the interactive elements of our project.
Challenges and Reflections
One of the initial challenges I faced was ensuring that all libraries were correctly installed and compatible with each other. That was mostly due to my misunderstanding about how “pip” works. I initially had everything in my notebook, but once I realized the libraries should be installed through the terminal/command line, I was able to make progress.
Once the environment was set up, the process of generating and saving data, as well as model initialization, went smoothly. This part of my adventure was crucial for understanding the basic components needed for the project and building a strong foundation. It took me days to figure out…
, I realized the importance of patience and attention to detail. Patience is definitely not a large part of my personality or temperament, but I realized through this endeavor that it (and attention to detail) are vitally important. Setting up the environment might seem like a straightforward task, but it’s the backbone of the entire project. Ensuring that everything is correctly installed and configured saves a lot of time and frustration down the line.
In the next post, we’ll dive into building the interactive web application using Flask. This is where the project starts to come to life, allowing us to interact with the model in real-time. Stay tuned!