30 julio, 2024

What is the train_test_split for?

Do you know what train_test_split is? If you want to know it and learn how to use it, don’t miss this post. Broadly speaking, we can say that train_test_split is a function included in the scikit.learn library and is used to divide arrays or matrices into random subsets of train and test.

Its full syntax is: sklearn.model_selection.train_test_split(*arrays, **options).

What is train_test_split?

The train_test_split function allows you to split a data set into two training and testing blocks of a model (train and test). This function is closely related to stratification, a process that is done with respect to a variable that is normally the variable target.

If we do not tell train_test_split to stratify, in some way, this function reserves us somewhat similar proportions. That is, it is intelligent enough not to make crazy divisions, such as 90-10% (remember that the ideal standard is 70-75% for train and 30-25% for test).

In order not to trust, in some way, that the algorithm of this function is intelligent enough, what is usually done is pass a parameter which is the stratification and we tell it what we want to stratify and with respect to which variable we want to do it.

So, The proportion that the target variable has is what we are going to have both in train as in test.

Let’s test the train_test_split with an example

from sklearn.model_selection import train_test_split fit (x_train, y_train) plot_decision_boundary (X_test, y_test, 0.05, knn)

The first thing we do is tell it to import the function. What’s after the function is how to use it:

To know what we are missing, we will explain each of the components of these commands:

test_size and train_size

We see test_size = 0.3, which corresponds to 30%. Just as there is test_size, there is also train_size, but one cancels out the otherthat is, we could not put test_size = 0.3 and train_size = 0.3, because one of those two is not going to respect the other.

Shuffle and random_state

He shuffle What it does is randomize, while what the random_state does is achieve some repetition in the results. Each of the train_tests that we have will have a random_state as a seed.

With this information, you will be able to notice that what we need to stratify. With layering, the command line would look like this:

X_train,

The order of the variables that we are declaring have a specific order, that is, the order of the factors in this case does alter the result.

So what do we do next? The following line already looks more like real training with a real model: knn = KNeighborsClassifier (n_neighbors = 1). fit(x_train, y_train).

As we see, the code we require to execute this process machine learning launching a model and trying different things is almost non-existent.

The result of this train_test_split function is:

This is what we have achieved. This is a real result, because this data has not been seen previously by the model. If we compare this result with another previously obtained, we will see that it is very different.

If we look closely, we haven’t done anything different, we’ve just had the same model trained with all the data. On the other hand, the last one is the same model, with the same number of parameters, trained only with the training data and tested only with the test data.

Likewise, the latter seems to be a «not real» result, since everything the model is doing has been seen before. This means that no new data has been entered.

Do you want to continue learning about machine learning?

If you liked this topic and want to learn more, we have the best option for you! With the guidance of our Big Data, Artificial Intelligence & Machine Learning Full Stack Bootcamp you can learn everything you need to become a great professional in the IT sector. This is an intensive 8-month training in which you will have the opportunity to continue your training process accompanied by the best professionals. Dare to transform your future and request more information now!

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *