A graph-based functional API for building complex scikit-learn pipelines¶
baikal is written in pure Python. It supports Python 3.5 and above.
Note: baikal is still a young project and there might be backward incompatible changes. The next development steps and backwards-incompatible changes are announced and discussed in this issue. Please subscribe to it if you use baikal.
What is baikal?¶
baikal is a graph-based, functional API for building complex machine learning pipelines of objects that implement the scikit-learn API. It is mostly inspired on the excellent Keras API for Deep Learning, and borrows a few concepts from the TensorFlow framework and the (perhaps lesser known) graphkit package.
baikal aims to provide an API that allows to build complex, non-linear machine learning pipelines that look like this:
with code that looks like this:
x1 = Input()
x2 = Input()
y_t = Input()
y1 = ExtraTreesClassifier()(x1, y_t)
y2 = RandomForestClassifier()(x2, y_t)
z = PowerTransformer()(x2)
z = PCA()(z)
y3 = LogisticRegression()(z, y_t)
ensemble_features = Stack()([y1, y2, y3])
y = SVC()(ensemble_features, y_t)
model = Model([x1, x2], y, y_t)
What can I do with it?¶
With baikal you can
build non-linear pipelines effortlessly
handle multiple inputs and outputs
add steps that operate on targets as part of the pipeline
nest pipelines
use prediction probabilities (or any other kind of output) as inputs to other steps in the pipeline
query intermediate outputs, easing debugging
freeze steps that do not require fitting
define and add custom steps easily
plot pipelines
All with boilerplate-free, readable code.
Why baikal?¶
The pipeline above (to the best of the author’s knowledge) cannot be easily built using scikit-learn’s composite estimators API as you encounter some limitations:
It is aimed at linear pipelines
You could add some step parallelism with the ColumnTransformer API, but this is limited to transformer objects.
Classifiers/Regressors can only be used at the end of the pipeline.
This means we cannot use the predicted labels (or their probabilities) as features to other classifiers/regressors.
You could leverage mlxtend’s StackingClassifier and come up with some clever combination of the above composite estimators (
Pipeline
s,ColumnTransformer
s, andStackingClassifier
s, etc), but you might end up with code that feels hard-to-follow and verbose.
Cannot handle multiple input/multiple output models.
Perhaps you could instead define a big, composite estimator class that integrates each of the pipeline steps through composition. This, however, most likely will require
writing big
__init__
methods to control each of the internal steps’ knobs;being careful with
get_params
andset_params
if you want to use, say,GridSearchCV
;and adding some boilerplate code if you want to access the outputs of intermediate steps for debugging.
By using baikal as shown in the example above, code can be more readable, less verbose and closer to our mental representation of the pipeline. baikal also provides an API to fit, predict with, and query the entire pipeline with single commands.
Installation¶
To install the latest released version from PyPI:
pip install baikal
If you wish to install the latest development version, you can do so with:
pip install git+https://github.com/alegonz/baikal.git@master#egg=baikal
Requirements¶
numpy
User guide¶
Key concepts¶
The baikal API introduces three basic elements:
Step: Steps are the building blocks of the API. Conceptually similar to TensorFlow’s operations and Keras layers, each Step is a unit of computation (e.g. PCA, Logistic Regression) that take the data from preceding Steps and produce data to be used by other Steps further in the pipeline. Steps are defined by combining the
Step
mixin class with a base class that implements the scikit-learn API. This is explained in more detail below.DataPlaceholder: The inputs and outputs of Steps. If Steps are like TensorFlow operations or Keras layers, then DataPlaceHolders are akin to tensors. Don’t be misled though, DataPlaceholders are just minimal, low-weight auxiliary objects whose main purpose is to keep track of the input/output connectivity between steps, and serve as the keys to map the actual input data to their appropriate Step. They are not arrays/tensors, nor contain any shape/type information whatsoever.
Model: A Model is a network (more precisely, a directed acyclic graph) of Steps, and it is defined from the input/output specification of the pipeline. Models have fit and predict routines that, together with graph-based engine, allow the automatic (feed-forward) computation of each of the pipeline steps when fed with data.
Quick-start guide¶
Without further ado, here’s a short example of a simple SVC model built with baikal:
import sklearn.svm
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from baikal import make_step, Input, Model
# 1. Define a step
SVC = make_step(sklearn.svm.SVC)
# 2. Build the model
x = Input()
y_t = Input()
y = SVC(C=1.0, kernel="rbf", gamma=0.5)(x, y_t)
model = Model(x, y, y_t)
# 3. Train the model
dataset = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, random_state=0
)
model.fit(X_train, y_train)
# 4. Use the model
y_test_pred = model.predict(X_test)
API walkthrough¶
As shown in the short example above, the baikal API consists of four basic steps:
Let’s take a look at each of them in detail. Full examples can be found in the project’s examples folder.
1. Define the steps¶
A step is defined very easily, just feed the provided make_step
function with the
class you want to make a step from:
import sklearn.linear_model
from baikal import make_step
LogisticRegression = make_step(sklearn.linear_model.LogisticRegression)
You can make a step from any class you like, so long that class implements the scikit-learn API.
What this function is doing under the hood, is to combine the given class with the Step
mixin class. The Step
mixin, among other things, endows the given class with a
__call__
method, making the class callable on the outputs (DataPlaceholder
objects)
of previous steps. If you prefer to do this manually, you only have to:
Define a class that inherits from both the
Step
mixin and the class you wish to make a step of (in that order!).In the class
__init__
, callsuper().__init__(...)
and pass the appropriate step parameters.
For example, to make a step for sklearn.linear_model.LogisticRegression
we do:
import sklearn.linear_model
from baikal import Step
# The order of inheritance is important!
class LogisticRegression(Step, sklearn.linear_model.LogisticRegression):
def __init__(self, *args, name=None, n_outputs=1, **kwargs):
super().__init__(*args, name=name,n_outputs=n_outputs,**kwargs)
Other steps are defined similarly (omitted here for brevity).
baikal can also handle steps with multiple input/outputs/targets. The base class may
implement a predict
/transform
method (the compute function) that take multiple
inputs and returns multiple outputs, and a fit method that takes multiple inputs and targets
(native scikit-learn classes at present take one input, return one output, and take at
most one target). In this case, the input/target arguments are expected to be a list of
(typically) array-like objects, and the compute function is expected to return a list of
array-like objects. For example, the base class may implement the methods like this:
class SomeClass(BaseEstimator):
...
def predict(self, Xs):
X1, X2 = Xs
# use X1, X2 to calculate y1, y2
return y1, y2
def fit(self, Xs, ys):
(X1, X2), (y1, y2) = Xs, ys
# use X1, X2, y1, y2 to fit the model
return self
2. Build the model¶
Once we have defined the steps, we can make a model like shown below. First, you create
the initial step, that serves as the entry-point to the model, by calling the Input
helper function. This outputs a DataPlaceholder representing one of the inputs to the
model. Then, all you have to do is to instantiate the steps and call them on the outputs
(DataPlaceholders from previous steps) as you deem appropriate. Finally, you instantiate
the model with the inputs, outputs and targets (also DataPlaceholders) that specify your
pipeline.
This style should feel familiar to users of Keras.
Note that steps that require target data (like ExtraTreesClassifier
, RandomForestClassifier
,
LogisticRegression
and SVC
) are called with two arguments. These arguments correspond
to the inputs (e.g. x1
, x2
) and targets (e.g. y_t
) of the step. These targets
are specified to the Model at instantiation via the third argument. baikal pipelines
are made of complex, heterogenous, non-differentiable steps (e.g. a whole RandomForestClassifier
,
with its own internal learning algorithm), so there’s no some magic automatic differentiation
that backpropagates the target information from the outputs to the appropriate steps, so
we must specify which step needs which targets directly.
from baikal import Input, Model
from baikal.steps import Stack
# Assume the steps below were already defined
x1 = Input()
x2 = Input()
y_t = Input()
y1 = ExtraTreesClassifier()(x1, y_t)
y2 = RandomForestClassifier()(x2, y_t)
z = PowerTransformer()(x2)
z = PCA()(z)
y3 = LogisticRegression()(z, y_t)
ensemble_features = Stack()([y1, y2, y3])
y = SVC()(ensemble_features, y_t)
model = Model([x1, x2], y, y_t)
You can call the same step on different inputs and targets to reuse the step (similar to
the concept of shared layers and nodes in Keras), and specify a different compute_func
/trainable
configuration on each call. This is achieved via “ports”: each call creates a new port
and associates the given configuration to it. You may access the configuration at each
port using the get_*_at(port)
methods.
(*) Steps are called on and output DataPlaceholders. DataPlaceholders are produced and consumed exclusively by Steps, so you do not need to instantiate these yourself.
3. Train the model¶
Now that we have built a model, we are ready to train it. The model also follows the
scikit-learn API, as it has a fit
method:
model.fit(X=[X1_train, X2_train], y=y_train)
-
baikal.Model.
fit
(self, X, y=None, **fit_params) Trains the model on the given input and target data.
The model will automatically propagate the data through the pipeline and fit any internal steps that require training.
- Parameters
X –
Input data (independent variables). It can be either of:
A single array-like object (in the case of a single input)
A list of array-like objects (in the case of multiple inputs)
A dictionary mapping DataPlaceholders (or their names) to array-like objects. The keys must be among the inputs passed at instantiation.
y –
Target data (dependent variables) (optional). It can be either of:
None (in the case all steps are either non-trainable and/or unsupervised learning steps)
A single array-like object (in the case of a single target)
A list of array-like objects(in the case of multiple targets)
A dictionary mapping target DataPlaceholders (or their names) to array-like objects. The keys must be among the targets passed at instantiation.
Targets required by steps that were set as non-trainable might be omitted.
fit_params – Parameters passed to the fit method of each model step, where each parameter name has the form
<step-name>__<parameter-name>
.
4. Use the model¶
To predict with the model, use the predict
method and pass it the input data like you
would for the fit
method. The model will automatically propagate the inputs through
all the steps and produce the outputs specified at instantiation.
y_test_pred = model.predict([X1_test, X2_test])
# This also works:
y_test_pred = model.predict({x1: X1_test, x2: X2_test})
-
baikal.Model.
predict
(self, X, output_names=None) Predict by applying the model on the given input data.
- Parameters
X – Input data. It follows the same format as in the
fit
method.output_names – Names of required outputs (optional). You can specify any final or intermediate output by passing the name of its associated data placeholder. This is useful for debugging. If not specified, it will return the outputs specified at instantiation.
- Returns
array-like or list of array-like – The computed outputs.
Models are query-able. That is, you can request other outputs other than those specified
at model instantiation. This allows querying intermediate outputs and ease debugging.
For example, to get both the output from PCA
and the ExtraTreesClassifier
:
outs = model.predict(
[X1_test, X2_test], output_names=["ExtraTreesClassifier_0:0/0", "PCA_0:0/0"]
)
You don’t need to pass inputs that are not required to compute the queried output.
For example, if we just want the output of PowerTransformer
:
outs = model.predict({x2: X2_data}, output_names="PowerTransformer_0:0/0")
Models are also nestable. In fact, Models are steps, too. This allows composing smaller models into bigger ones, like so:
# Assume we have two previously built complex
# classifier models, perhaps loaded from a file.
submodel1 = ...
submodel2 = ...
# Now we make an stacked classifier from both submodels
x = Input()
y_t = Input()
y1 = submodel1(x)
y2 = submodel2(x, y_t)
z = Stack()([y1, y2])
y = SVC()(z, y_t)
bigmodel = Model(x, y, y_t)
Generalizations introduced by the API¶
The baikal API generalizes scikit-learn estimators and pipelines in several ways:
Steps can be combined into non-linear pipelines. That is,
steps may be parallel,
feed-forward connections my exist between non-consecutive steps,
an input of the pipeline is not necessarily taken from the first step,
an output of the pipeline is not necessarily produced from the last step.
Steps can take multiple inputs and produce multiple outputs. This, for example, is useful for defining steps for aggregating, concatenating or splitting arrays; building models that take multi-modal data, for example and input for an image, and an input for tabular data; and building models with mixed classification/regression outputs.
Steps can lack a fit method. Models allow steps that have no fit
method
(a.k.a. stateless estimators). At training time, such steps will omit their own training
and simply do inference on their inputs to produce the outputs required by successive
steps.
Also, the Model graph engine will, for each step, pass only the arguments associated to
the inputs and targets that were specified for that step. So, if you (naturally) didn’t
specify any targets for an unsupervised step, then that step can safely define a fit
method with a fit(X)
signature. This avoids having to define methods with a
misleading fit(X, y=None)
signature if the step either does not require target data
or does not require a fit method at all, improving the readability of estimator classes.
In short, this means steps can
omit defining
fit
for stateless steps,define
fit(X)
for unsupervised steps,define
fit(X, y)
for supervised and semi-supervised steps.
Steps can specify any function for inference. Canonical scikit-learn estimators
typically define either a predict
or a transform
method as their function for
inference, and the Pipeline API only admits these two. More complex
models, however, may require estimators that do other kinds of computations such as
prediction probabilities, the decision function, or the leaf indices of decision tree
predictions. To allow this, the Step API generalizes these as “compute functions” and
provides a compute_func
argument that can be used to specify predict_proba
,
decision_function
, apply
or any other function for inference.
Steps can be frozen. This is done via a trainable
boolean flag and allows you
to skip steps during training time. This is useful if you have a pre-trained estimator
that you would like to reuse in another model without re-training it when training the
whole model.
Steps can specify special behavior at training time. Some estimators define special
fit_transform
or fit_predict
methods that do both training and inference in a
single swoop. Usually, such methods are meant to leverage implementations that are more
efficient than calling fit
and predict
/transform
separately, or meant for
transductive estimators as such estimators don’t allow separate training and inference
regimes. From the perspective of the execution of a pipeline at training time, where
training and inference (to produce the outputs required by successor steps) is done for
each step in tandem, these methods can be generalized to provide a means to control
these stages jointly and define special behaviors. This can be useful, for example, for
implementing training protocols such as that of stacked classifiers, where the
classifiers in the first stage are trained on the input data, but instead compute
out-of-fold predictions for the next stage in the stack. The Step API provides this via
a fit_compute_func
argument which, if specified, will be used by the graph execution
instead of using fit
and compute_func
separately.
Steps can be shared. Steps can be called on different inputs and targets (similar to
the concept of shared layers and nodes in Keras), and specify a different behavior (that
is, a specific configuration of compute_func
, fit_compute_func
and trainable
),
on each call. The mapping between inputs/targets and the behavior is achieved via
“ports”: each call creates a new port on the step and associates the given configuration
to the inputs/targets the step was called on. The Model graph engine will then use the
appropriate configuration on each set of inputs and targets.
Shared steps allow reusing a step and its learned parameters on different inputs. For example, this is particularly useful for reusing learned transformations on targets. Also, this useful for reusing steps of stateless estimators to apply the same computation (e.g. casting data types, dropping dimensions) on several inputs.
Utilities¶
Persisting the model¶
Like native scikit-learn objects, models can be serialized with pickle or joblib without any extra setup:
import joblib
joblib.dump(model, "model.pkl")
model_reloaded = joblib.load("model.pkl")
Keep in mind, however, the security and maintainability limitations of these formats.
scikit-learn wrapper for GridSearchCV
¶
Currently, baikal also provides a wrapper utility class that allows models to used in scikit-learn’s GridSearchCV API. Below there’s a code snippet showing its usage. It follows the style of Keras’ own wrapper.
See Tune a model with GridSearchCV for an example script of this utility.
A future release of baikal plans to include a custom GridSearchCV
API, based on
the original scikit-learn implementation, that can handle baikal models natively, avoiding
a couple of gotchas with the current wrapper implementation (mentioned below).
# 1. Define a function that returns your baikal model
def build_fn():
x = Input()
y_t = Input()
h = PCA(random_state=random_state, name="pca")(x)
y = LogisticRegression(random_state=random_state, name="classifier")(h, y_t)
model = Model(x, y, y_t)
return model
# 2. Define a parameter grid
# - keys have the [step-name]__[parameter-name] format, similar to sklearn Pipelines
# - You can also search over the steps themselves using [step-name] keys
param_grid = [
{
"classifier": [LogisticRegression()],
"classifier__C": [0.01, 0.1, 1],
"pca__n_components": [1, 2, 3, 4],
},
{
"classifier": [RandomForestClassifier()],
"classifier__n_estimators": [10, 50, 100],
},
]
# 3. Instantiate the wrapper
sk_model = SKLearnWrapper(build_fn)
# 4. Use GridSearchCV as usual
gscv_baikal = GridSearchCV(sk_model, param_grid)
gscv_baikal.fit(x_data, y_data)
best_model = gscv_baikal.best_estimator_.model
Currently there are a couple of gotchas:
The
cv
argument ofGridSearchCV
will default to KFold if the estimator is a baikal Model, so you have to specify an appropriate splitter directly if you need another splitting scheme.GridSearchCV
cannot handle models with multiple inputs/outputs. A way to work around this is to split the input data and merge the outputs within the model.
Plotting your model¶
The baikal package includes a plot utility.
from baikal.plot import plot_model
plot_model(model, filename="model.png")
In order to use the plot utility, you need to install pydot and graphviz.
For the example above, it produces this:
Examples¶
Stacked classifiers (naive protocol)¶
Similar to the the example in the quick-start guide, (a naive) stacks of classifiers
(or regressors) can be built like shown below. Note that you can specify the function
the step should use for computation, in this case compute_func='predict_proba'
to
use the label probabilities as the features of the meta-classifier.
x = Input()
y_t = Input()
y_p1 = LogisticRegression()(x, y_t, compute_func="predict_proba")
y_p2 = RandomForestClassifier()(x, y_t, compute_func="predict_proba")
# predict_proba returns arrays whose columns sum to one, so we drop one column
drop_first_col = Lambda(lambda array: array[:, 1:])
y_p1 = drop_first_col(y_p1)
y_p2 = drop_first_col(y_p2)
ensemble_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier()(ensemble_features, y_t)
model = Model(x, y_p, y_t)
import sklearn.datasets
import sklearn.ensemble
import sklearn.linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack, Lambda
# ------- Define steps
LogisticRegression = make_step(sklearn.linear_model.LogisticRegression)
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier)
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)
# ------- Load dataset
data = sklearn.datasets.load_breast_cancer()
X, y_p = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y_p, test_size=0.2, random_state=0
)
# ------- Build model
x = Input()
y_t = Input()
y_p1 = LogisticRegression(solver="liblinear", random_state=0)(
x, y_t, compute_func="predict_proba"
)
y_p2 = RandomForestClassifier(random_state=0)(x, y_t, compute_func="predict_proba")
# predict_proba returns arrays whose columns sum to one, so we drop one column
drop_first_col = Lambda(lambda array: array[:, 1:])
y_p1 = drop_first_col(y_p1)
y_p2 = drop_first_col(y_p2)
stacked_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier(random_state=0)(stacked_features, y_t)
model = Model(x, y_p, y_t)
plot_model(model, filename="stacked_classifiers_naive.png", dpi=96)
# ------- Train model
model.fit(X_train, y_train)
# ------- Evaluate model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("F1 score on train data:", f1_score(y_train, y_train_pred))
print("F1 score on test data:", f1_score(y_test, y_test_pred))
Stacked classifiers (standard protocol)¶
In the naive stack above, each classifier in the 1st level will calculate the predictions
for the 2nd level using the same data it used for fitting its parameters. This is prone
to overfitting as the 2nd level classifier will tend to give more weight to an overfit
classifier in the 1st level. To avoid this, the standard protocol recommends that, during
fit, the 1st level classifiers are still trained on the original data, but instead they
provide out-of-fold (OOF) predictions to the 2nd level classifier. To achieve this special
behavior, we leverage the fit_compute_func
API: we define a fit_predict
method
that does the fitting and the OOF predictions, and add it as a method of the 1st level
classifiers (LogisticRegression
and RandomForestClassifier
, in the example) when
making the steps. baikal will then detect and use this method during fit.
from sklearn.model_selection import cross_val_predict
def fit_predict(self, X, y):
self.fit(X, y)
return cross_val_predict(self, X, y, method="predict_proba")
attr_dict = {"fit_predict": fit_predict}
# 1st level classifiers
LogisticRegression = make_step(sklearn.linear_model.LogisticRegression, attr_dict)
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier, attr_dict)
# 2nd level classifier
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)
The rest of the stack is built exactly the same as in the naive example.
import sklearn.datasets
import sklearn.ensemble
import sklearn.linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict, train_test_split
from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack, Lambda
# ------- Define steps
# During fit, the 1st level classifiers must be trained on the original data, but must
# provide out-of-fold (OOF) predictions to the 2nd level classifier. To achieve this we
# leverage the fit_compute_func API to configure this behavior. In this case we define
# a fit_predict method that does the fitting and the OOF predictions, and add it as a
# method of the 1st level classifiers (LogisticRegression and RandomForestClassifier)
# when making the steps. baikal will then detect and use this method during fit.
def fit_predict(self, X, y):
self.fit(X, y)
return cross_val_predict(self, X, y, method="predict_proba")
attr_dict = {"fit_predict": fit_predict}
LogisticRegression = make_step(sklearn.linear_model.LogisticRegression, attr_dict)
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier, attr_dict)
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)
# ------- Load dataset
data = sklearn.datasets.load_breast_cancer()
X, y_p = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y_p, test_size=0.2, random_state=0
)
# ------- Build model
# The model is built similarly as the naive case. The difference is that during fit
# baikal will detect and use the fit_predict method above.
x = Input()
y_t = Input()
y_p1 = LogisticRegression(solver="liblinear", random_state=0)(
x, y_t, compute_func="predict_proba"
)
y_p2 = RandomForestClassifier(random_state=0)(x, y_t, compute_func="predict_proba")
# predict_proba returns arrays whose columns sum to one, so we drop one column
drop_first_col = Lambda(lambda array: array[:, 1:])
y_p1 = drop_first_col(y_p1)
y_p2 = drop_first_col(y_p2)
stacked_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier(random_state=0)(stacked_features, y_t)
model = Model(x, y_p, y_t)
plot_model(model, filename="stacked_classifiers_standard.png", dpi=96)
# ------- Train model
model.fit(X_train, y_train)
# ------- Evaluate model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("F1 score on train data:", f1_score(y_train, y_train_pred))
print("F1 score on test data:", f1_score(y_test, y_test_pred))
Classifier chain¶
The API also lends itself for more interesting configurations, such as that of classifier chains. By leveraging the API and Python’s own control flow, a classifier chain model can be built as follows:
x = Input()
y_t = Input()
order = list(range(n_targets))
random.shuffle(order)
squeeze = Lambda(np.squeeze, axis=1)
ys_t = Split(n_targets, axis=1)(y_t)
ys_p = []
for j, k in enumerate(order):
x_stacked = ColumnStack()([x, *ys_p[:j]])
ys_t[k] = squeeze(ys_t[k])
ys_p.append(LogisticRegression()(x_stacked, ys_t[k]))
ys_p = [ys_p[order.index(j)] for j in range(n_targets)]
y_p = ColumnStack()(ys_p)
model = Model(x, y_p, y_t)
Sure, scikit-learn already does have ClassifierChain and RegressorChain classes for this. But with baikal you could, for example, mix classifiers and regressors to predict multilabels that include both categorical and continuous labels.
import numpy as np
import random
import sklearn.linear_model
from sklearn.datasets import fetch_openml
from sklearn.metrics import jaccard_score
from sklearn.model_selection import train_test_split
from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack, Split, Lambda
# ------- Define steps
LogisticRegression = make_step(sklearn.linear_model.LogisticRegression)
# ------- Load a multi-label dataset
# (from https://www.openml.org/d/40597)
X, Y = fetch_openml("yeast", version=4, return_X_y=True)
Y = Y == "TRUE"
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
n_targets = Y.shape[1]
random.seed(87)
order = list(range(n_targets))
random.shuffle(order)
# ------- Build model
x = Input()
y_t = Input()
squeeze = Lambda(np.squeeze, axis=1)
ys_t = Split(n_targets, axis=1)(y_t)
ys_p = []
for j, k in enumerate(order):
x_stacked = ColumnStack()(inputs=[x, *ys_p[:j]])
ys_t[k] = squeeze(ys_t[k])
ys_p.append(LogisticRegression(solver="lbfgs")(x_stacked, ys_t[k]))
ys_p = [ys_p[order.index(j)] for j in range(n_targets)]
y_p = ColumnStack()(ys_p)
model = Model(inputs=x, outputs=y_p, targets=y_t)
# This might take a few seconds
plot_model(model, filename="classifier_chain.png", dpi=96)
# ------- Train model
model.fit(X_train, Y_train)
# ------- Evaluate model
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)
print(
"Jaccard score on train data:",
jaccard_score(Y_train, Y_train_pred, average="samples"),
)
print(
"Jaccard score on test data:",
jaccard_score(Y_test, Y_test_pred, average="samples"),
)
Transformed target¶
You can also call steps on the targets to apply transformations on them. Note that by making the transformer a shared step, you can re-use learned parameters to apply the inverse transform later in the pipeline.
transformer = QuantileTransformer(n_quantiles=300, output_distribution="normal")
x = Input()
y_t = Input()
# QuantileTransformer requires an explicit feature dimension, hence the Lambda step
y_t_trans = Lambda(np.reshape, newshape=(-1, 1))(y_t)
y_t_trans = transformer(y_t_trans)
y_p_trans = RidgeCV()(x, y_t_trans)
y_p = transformer(y_p_trans, compute_func="inverse_transform", trainable=False)
# Note that transformer is a shared step since it was called twice
model = Model(x, y_p, y_t)
# Adapted from the scikit-learn example in:
# https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html#sphx-glr-auto-examples-compose-plot-transformed-target-py
import numpy as np
import sklearn.linear_model
import sklearn.preprocessing
from sklearn.datasets import load_boston
from sklearn.metrics import median_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from baikal import make_step, Input, Model
from baikal.plot import plot_model
from baikal.steps import Lambda
# ------- Define steps
RidgeCV = make_step(sklearn.linear_model.RidgeCV)
QuantileTransformer = make_step(sklearn.preprocessing.QuantileTransformer)
# ------- Load dataset
dataset = load_boston()
target = np.array(dataset.feature_names) == "DIS"
X = dataset.data[:, np.logical_not(target)]
y = dataset.data[:, target].squeeze()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# ------- Build model
transformer = QuantileTransformer(n_quantiles=300, output_distribution="normal")
x = Input()
y_t = Input()
# QuantileTransformer requires an explicit feature dimension, hence the Lambda step
y_t_trans = Lambda(np.reshape, newshape=(-1, 1))(y_t)
y_t_trans = transformer(y_t_trans)
y_p_trans = RidgeCV()(x, y_t_trans)
y_p = transformer(y_p_trans, compute_func="inverse_transform", trainable=False)
model = Model(x, y_p, y_t)
plot_model(model, filename="transformed_target.png", dpi=96)
# ------- Train model
model.fit(X_train, y_train)
# ------- Evaluate model
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = median_absolute_error(y_test, y_pred)
print("R^2={}\nMAE={}".format(r2, mae))
Tune a model with GridSearchCV
¶
Below is an example showing how to use the scikit-learn wrapper to tune the parameters
of a baikal model using GridSearchCV
.
import sklearn.decomposition
import sklearn.ensemble
import sklearn.decomposition
import sklearn.linear_model
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from baikal import Input, Model, make_step
from baikal.sklearn import SKLearnWrapper
LogisticRegression = make_step(sklearn.linear_model.LogisticRegression)
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier)
PCA = make_step(sklearn.decomposition.PCA)
def build_fn():
x = Input()
y_t = Input()
h = PCA(random_state=random_state, name="pca")(x)
y_p = LogisticRegression(random_state=random_state, name="classifier")(h, y_t)
model = Model(x, y_p, y_t)
return model
iris = datasets.load_iris()
x_data = iris.data
y_data = iris.target
random_state = 123
verbose = 0
# cv will default to KFold if the estimator is a baikal Model
# so we have to pass StratifiedKFold directly
cv = StratifiedKFold(n_splits=3, random_state=random_state)
param_grid = [
{
"classifier": [
LogisticRegression(
random_state=random_state, solver="lbfgs", multi_class="multinomial"
)
],
"classifier__C": [0.01, 0.1, 1],
"pca__n_components": [1, 2, 3, 4],
},
{
"classifier": [RandomForestClassifier(random_state=random_state)],
"classifier__n_estimators": [10, 50, 100],
"pca__n_components": [1, 2, 3, 4],
},
]
sk_model = SKLearnWrapper(build_fn)
gscv_baikal = GridSearchCV(
sk_model,
param_grid,
cv=cv,
scoring="accuracy",
return_train_score=True,
verbose=verbose,
)
gscv_baikal.fit(x_data, y_data)
print("Best score:", gscv_baikal.best_score_)
print("Best parameters", gscv_baikal.best_params_)
# The model with the best parameters can be accessed via:
# gscv_baikal.best_estimator_.model
API Reference¶
This is the class and function reference of baikal.
Core classes
Mixin class to endow scikit-learn classes with Step capabilities. |
|
A Model is a network (more precisely, a directed acyclic graph) of Steps, and it is defined from the input/output specification of the pipeline. |
Steps
Step for arbitrary functions. |
|
Step for stacking arrays along the columns. |
|
Step for concatenating arrays. |
|
Step for splitting arrays. |
|
Step for stacking arrays. |
Utilities
|
Creates a step subclass from the given base class. |
|
|
Wrapper utility class that allows models to used in scikit-learn’s |
Get global configuration parameters. |
|
|
Set global configuration parameters. |
Known issues¶
Pickle serialization/deserialization in models using CatBoost steps¶
When trying to use a model loaded from a pickle file and that contains CatBoost steps, you might see the following error:
>>> model = joblib.load("model.pkl")
>>> model.predict(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/venv/lib/python3.7/site-packages/baikal/_core/model.py", line 470, in predict
X_norm, [], outputs, allow_unused_inputs=True, follow_targets=False
File "/venv/lib/python3.7/site-packages/baikal/_core/model.py", line 191, in _get_required_nodes
required_nodes |= backtrack(output)
File "/venv/lib/python3.7/site-packages/baikal/_core/model.py", line 176, in backtrack
parent_node = output.node
File "/venv/lib/python3.7/site-packages/baikal/_core/data_placeholder.py", line 44, in node
return self.step._nodes[self.port]
AttributeError: 'CatBoostClassifierStep' object has no attribute '_nodes'
This is because CatBoost estimators (CatBoostClassifier
, CatBoostRegressor
)
implement their own __getstate__
and __setstate__
methods and, if they are
not overridden appropriately, they won’t include Step
-specific attributes in the
pickled result. The solution to this problem is to override the __getstate__
and
__setstate__
methods to include the missing attributes as follows:
class CatBoostClassifierStep(Step, CatBoostClassifier):
def __init__(self, *args, name=None, n_outputs=1, **kwargs):
super().__init__(*args, name=name, n_outputs=n_outputs, **kwargs)
def __getstate__(self):
state = super().__getstate__()
state["_name"] = self._name
state["_nodes"] = self._nodes
state["_n_outputs"] = self._n_outputs
return state
def __setstate__(self, state):
self._name = state.pop("_name")
self._nodes = state.pop("_nodes")
self._n_outputs = state.pop("_n_outputs")
super().__setstate__(state)
Contributing guidelines¶
Bug reports and fixes are always welcome!
Contributions to extend/refactor/improve/document the API are also welcome! baikal is currently a one-man operation, and it could benefit from more minds and hands working on it :)
If you would like to contribute to the project (thank you!), please follow the guidelines below.
Bug reports¶
Check if the bug happens in master. If the bug persists, then
Check the issues page to see if the issue has been reported, solved or closed before. Make sure to remove the
is:open
qualifier so that closed issues are also visible. If the bug is indeed new, thenOpen a new issue and provide a brief explanation of the bug describing the expected and the actual behavior, and add a code sample to reproduce it. Please refer to the template provided when clicking the “New issue” button.
If possible, try to fix it and submit a PR yourself :)
Feature requests¶
Check in the issues page if a similar idea has already been proposed. If it hasn’t, then
Open an issue describing the feature and why it would be useful and important to have. The feature must be accompanied with a code snippet showing how the feature would be used. Please refer to the template provided when clicking the “New issue” button.
Make a case for your proposal and address any questions/comments/suggestions.
If the feature is accepted, you may go ahead and submit a PR.
baikal’s goal is to make building complex machine learning pipelines easier, so a good API feature has (ideally all) the following traits:
makes a task easier,
is of general use,
is intuitive,
is hard to use incorrectly,
makes code more readable.
Submitting a pull request¶
Scope: A PR must address one issue (unless the same solution fixes two or more issues of course) and should be decoupled from any other proposed changes as much as possible. If the PR involves several changes, it might be more appropriate to split it into several PRs, as several PRs are easier to review/understand/backport/revert than one huge PR. Please add a reference to the related issue in the description (e.g.
Fixes #123
,Implements #456
), this will close the issue automatically when the PR is merged.Tests: Existing tests must pass and no line should be left uncovered. If the PR fixes a bug, it should also add a test covering the case where the bug happens. If the PR introduces a new feature, it should add the appropriate tests confirming the correct functioning of the feature. Remember that the reported coverage is only line and branch coverage. If possible, go the extra mile and devise tests that cover more complex yet important interactions of multiple conditions. For a new API feature, usually the feature use cases can also serve as the test cases, so you might be able to shoot two birds with one stone!
Code format: This project adopts the black code format. Make sure to setup the pre-commit hook before committing any changes.
Commits: Commits, like PRs, should be granular and decoupled from each other. Ideally, the PR’s commit history tells a story: the reviewer should be able to easily grasp what changes were made when glancing at the commit history. Please add descriptive commit messages and avoid cryptic messages like
Some refactoring
orMore fixes
. When writing a commit message, usually the why is more important than the what (one can check the diff for that), so try to explain the reasons for that change. Remember: the audience of a commit message is another developer in the future (including your future self) that might need to understand the reasons why and the context where the changes happened.Documentation: Any changes must be accompanied by the appropriate documentation, if applicable. This might include adding or revising the docstrings, updating the user guide, or adding an example.
Changelog: Please update the Changelog appropriately.
License: by submitting a pull request to the project, you’re offering your changes under this project’s license.
Setting up the development environment¶
Clone the project.
From the project root folder run:
make setup_dev
.This will create a virtualenv and install the package in development mode.
It will also install a pre-commit hook for the black code formatter.
You need Python 3.5 or above.
To run the tests use:
make test
, ormake test-cov
to include coverage.The tests include a test for the plot utility, so you need to install graphviz.
Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
0.4.2 - 2020-11-15¶
Fixed¶
Fix a bug where
Model.fit
would fail for unhashable steps (PR #43).
Changed¶
Deprecate default value of
class_name
argument inmake_step
function (PR #45).
0.4.1 - 2020-05-17¶
Fixed¶
Fix bug in repr of Model class (PR #35).
0.4.0 - 2020-05-16¶
Added¶
Add capability to specify a name to the class made by
make_step
(PR #34).
Fixed¶
Improved and fixed a bug in repr of steps that causes a RuntimeError for scikit-learn 0.23.0. (PR #33).
Fix bug where
*args
was not included in the constructor of the class made bymake_step
(and fix related examples in the docs) (PR #34).
0.3.1 - 2020-04-26¶
Fixed¶
Fix bug where
get_params
would break when the base class did not implement an__init__
method (PR #32).
0.3.0 - 2020-02-23¶
Added¶
Add support for shared steps (PR #19). Now steps can be called several times on different inputs.
This is a backwards-incompatible change. The outputs of the steps now follow the following format:
step_name:port/output_number
. (Previously it wasstep_name/output_number
)
Add option to include targets in
plot_model
(PR #20).Add new
fit_compute_func
argument toStep.__call__
that allows to specify custom behavior at fit time (PR #22).Add documentation built with Sphinx and hosted on baikal.readthedocs.io (PR #29).
Changed¶
Move
compute_func
(previouslyfunction
) andtrainable
args toStep.__call__
(PR #18).Also, the default value is changed from
None
to"auto"
.This is a backwards-incompatible change.
Raise
RuntimeError
chained with the original exception inModel.fit
andModel.predict
.
Fixed¶
Add clarification in that steps must be named in
build_fn
when usingSKLearnWrapper
Fix bug where the compute function was not being transferred when replacing a step in
Model.set_params
.Fix an API inconsistency regarding the handling of the arguments of fit/compute for steps with multiple inputs and targets (PR #21).
Fix several bugs in
plot_model
(it was largely broken) (PR #20, PR #24).
0.2.0 - 2019-11-16¶
Added¶
This CHANGELOG file.
Introduced new targets API (PR #1).
Steps now take an optional
targets
argument at call time to specify inputs for target data at fit time.Correspondingly,
Model
also takes an additional argument for these targets.The
extra_targets
argument inModel.fit
was removed.
Step enhancements
Fixed¶
0.1.0 - 2019-06-01¶
Added¶
Everything. This is the first (pre-release) version.
License¶
BSD 3-Clause License
Copyright (c) 2019-2020, Alejandro González Tineo
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.