As part of a project for work, I needed to port some R machine learning code to python. We weren’t trying to achieve 100% identity of results, instead opting to try to implement things “natively” to the language, and…

The rsample package has the

initial_split(data, prop = 3/4, strata = NULL, ...)

function has the strata argument, which allows us to select which variable to stratify with. If this variable is categorical, porting the code to python is easy, as we would use

sklearn.model_selection.train_test_split(*arrays, **options)

If the variable is numerical, it’s harder to figure out what rsample is doing by default. There is, however, a

make_strata(x, nunique = 5, pool = 0.15, depth = 20)

function, which most likely is operating with the same principles.

The help for this states that:

  1. For numeric data, if the number of unique levels is less than nunique, the data are treated as categorical data.

  2. For categorical inputs, the function will find levels of x than occur in the data with percentage less than pool. The values from these groups will be randomly assigned to the remaining strata (as will data points that have missing values in x). (i.e. by default, categories that represent less than 15% of the dataset will be pooled before sampling happens)

  3. For numeric data with more unique values than nunique, the data will be converted to being categorical based on percentiles of the data. The percentile groups will have no more than 20 percent of the data in each group. Again, missing values in x are randomly assigned to groups.

I am assuming this 20% is being specified by the depth parameter.

The easiest way to replicate this in python for numeric data seems to be to create a “dummy” variable, which will contain the quartiles as factors, and then to stratify using that:

pd.qcut(ameshousingClean[‘Sale_Price’], 10, labels=range(10))