A Guide to Data Sampling Techniques in Python for Machine Learning
In the realm of machine learning, the process of splitting data into training and test sets is crucial for model evaluation and validation. Various sampling approaches can be employed to achieve this split, each with its own benefits and considerations. Let’s explore how to sample data in Python using different techniques, focusing on the example of splitting a dataset of vehicles evaluated by the EPA.
Sampling Data in Python:
Before delving into model training, it’s essential to prepare the data by separating the dependent variable (response) from the independent variables (predictors). In our example, the CO2 emissions column serves as the dependent variable, while the other columns are independent variables.
Simple Random Sampling:
The train_test_split
function from the sklearn.model_selection
submodule in scikit-learn enables us to perform simple random sampling to split the data into training and test sets. By default, this function allocates 25% of the original data to the test set. We can adjust the allocation percentage using the test_size
argument.
Stratified Random Sampling:
In contrast to simple random sampling, stratified random sampling aims to maintain the distribution of values for a specific column between the training and test sets. This ensures that the representation of different categories remains consistent across the datasets. By specifying the stratify
argument in train_test_split
, we can achieve this stratification based on a chosen column.
Example Code:
Let’s demonstrate the sampling techniques in Python using the dataset of vehicles evaluated by the EPA:
import pandas as pd
from sklearn.model_selection import train_test_split
# Load dataset
vehicles = pd.read_csv('vehicles_dataset.csv')
# Separate dependent variable (Y) and independent variables (X)
response = 'CO2 Emissions'
Y = vehicles[response]
predictors = list(vehicles.columns)
predictors.remove(response)
X = vehicles[predictors]
# Perform simple random sampling
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)
# Perform stratified random sampling
X_train_strat, X_test_strat, Y_train_strat, Y_test_strat = train_test_split(X, Y, test_size=0.01, stratify=vehicles['Drive'], random_state=42)
In this code snippet, we load the dataset of vehicles, separate the dependent and independent variables, and then split the data using both simple random sampling and stratified random sampling techniques. By comparing the distributions of values in the test set obtained through these methods, we can observe the impact of sampling strategies on data representation.
Data sampling is a fundamental aspect of machine learning workflows, influencing model performance and generalization capabilities. By mastering different sampling techniques and understanding their implications, data scientists can enhance the robustness and reliability of their machine learning models. Experimenting with various sampling approaches on diverse datasets can provide valuable insights into data behavior and model outcomes.