Building scikit-learn transformers 5 Jun 2015

scikit-learn is a great library for doing machine learning in Python, and one of my favorite things about it is its interface. All objects in scikit-learn, whether data transformers or predictors, have a similar interface, making it easy to use your own transformers or models, but I haven’t seen this documented much.

For transformers, you have to define the methods .fit(self, X, y=None) and .transform(self, X). There is a class, TransformerMixin, that doesn’t do much besides add a .fit_transform method that calls .fit and .transform, but it’s still nice to inherit from it in order to document that you’re intending to make your code work well with scikit-learn.

I’m going to make a really dumb transformer: it takes any data and returns a feature vector of [1].

from sklearn.base import TransformerMixin

class DumbFeaturizer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[1] for _ in X]

Note that .fit returned self: this is standard behavior for .fit methods in scikit-learn.

Let’s build a better featurizer. Let’s say I have a lot of text and I want to extract certain data from it. I’m going to build a featurizer that takes a list of functions, calls each function with our text, and returns the results of all functions as a feature vector.

import re

def longest_run_of_capitol_letters_feature(text):
    """Find the longest run of capitol letters and return their length."""
    runs = sorted(re.findall(r"[A-Z]+", text), key=len)
    if runs:
        return len(runs[-1])
    else:
        return 0

def percent_character_feature(char):
    """Return percentage of text that is a particular char compared to total
    text length."""
    def feature_fn(text):
        periods = text.count(char)
        return periods / len(text)
    return feature_fn

class FunctionFeaturizer(TransformerMixin):
    def __init__(self, *featurizers):
        self.featurizers = featurizers

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Given a list of original data, return a list of feature vectors."""
        fvs = []
        for datum in X:
            fv = [f(datum) for f in self.featurizers]
            fvs.append(fv)
        return np.array(fvs)

Let’s run this on some SMS spam data.

from sklearn.tree import DecisionTreeClassifier

sms_featurizer = FunctionFeaturizer(longest_run_of_capitol_letters_feature,
                                    percent_character_feature("."))
sms_featurizer.transform(sms_data[:10])

X_train, X_test, y_train, y_test = train_test_split(sms_data, sms_results)

pipe = make_pipeline(sms_featurizer, DecisionTreeClassifier())
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
# => 0.91385498923187369

You might think that was a pretty good result if you didn’t know 87% of the SMS messages are ham. Anyway, pretty cool that this works!

You can build any sort of transformer you want this way. I thought it’d be a good idea to build one for use in debugging: it takes another transformer and then shows us data before and after the transformer.

class PipelineDebugger(TransformerMixin):
    def __init__(self, transformer):
        self.transformer = transformer

    def fit(self, X, y=None):
        self.transformer.fit(X, y)
        return self

    def transform(self, X):
        print(self.transformer.__class__.__name__)
        idx = random.randrange(0, len(X))
        print("Before", "=" * 40)
        print(X[idx])
        X = self.transformer.transform(X)
        print("After ", "=" * 40)
        print(X[idx])
        return X

pipe = make_pipeline(PipelineDebugger(sms_featurizer), DecisionTreeClassifier())
pipe.fit(X_train, y_train)

# FunctionFeaturizer
# Before ========================================
# LOL .. *grins* .. I'm not babe, but thanks for thinking of me!
# After  ========================================
# [ 3.          0.06451613]