scikit-learn is a great library for doing machine learning in Python, and one of my favorite things about it is its interface. All objects in scikit-learn, whether data transformers or predictors, have a similar interface, making it easy to use your own transformers or models, but I haven’t seen this documented much.
For transformers, you have to define the methods
.fit(self, X, y=None) and
.transform(self, X). There is a class,
TransformerMixin, that doesn’t do much besides add a
.fit_transform method that calls
.transform, but it’s still nice to inherit from it in order to document that you’re intending to make your code work well with scikit-learn.
I’m going to make a really dumb transformer: it takes any data and returns a feature vector of
from sklearn.base import TransformerMixin class DumbFeaturizer(TransformerMixin): def __init__(self): pass def fit(self, X, y=None): return self def transform(self, X): return [ for _ in X]
self: this is standard behavior for
.fit methods in scikit-learn.
Let’s build a better featurizer. Let’s say I have a lot of text and I want to extract certain data from it. I’m going to build a featurizer that takes a list of functions, calls each function with our text, and returns the results of all functions as a feature vector.
import re def longest_run_of_capitol_letters_feature(text): """Find the longest run of capitol letters and return their length.""" runs = sorted(re.findall(r"[A-Z]+", text), key=len) if runs: return len(runs[-1]) else: return 0 def percent_character_feature(char): """Return percentage of text that is a particular char compared to total text length.""" def feature_fn(text): periods = text.count(char) return periods / len(text) return feature_fn class FunctionFeaturizer(TransformerMixin): def __init__(self, *featurizers): self.featurizers = featurizers def fit(self, X, y=None): """All SciKit-Learn compatible transformers and classifiers have the same interface. `fit` always returns the same object.""" return self def transform(self, X): """Given a list of original data, return a list of feature vectors.""" fvs =  for datum in X: fv = [f(datum) for f in self.featurizers] fvs.append(fv) return np.array(fvs)
Let’s run this on some SMS spam data.
from sklearn.tree import DecisionTreeClassifier sms_featurizer = FunctionFeaturizer(longest_run_of_capitol_letters_feature, percent_character_feature(".")) sms_featurizer.transform(sms_data[:10]) X_train, X_test, y_train, y_test = train_test_split(sms_data, sms_results) pipe = make_pipeline(sms_featurizer, DecisionTreeClassifier()) pipe.fit(X_train, y_train) pipe.score(X_test, y_test) # => 0.91385498923187369
You might think that was a pretty good result if you didn’t know 87% of the SMS messages are ham. Anyway, pretty cool that this works!
You can build any sort of transformer you want this way. I thought it’d be a good idea to build one for use in debugging: it takes another transformer and then shows us data before and after the transformer.
class PipelineDebugger(TransformerMixin): def __init__(self, transformer): self.transformer = transformer def fit(self, X, y=None): self.transformer.fit(X, y) return self def transform(self, X): print(self.transformer.__class__.__name__) idx = random.randrange(0, len(X)) print("Before", "=" * 40) print(X[idx]) X = self.transformer.transform(X) print("After ", "=" * 40) print(X[idx]) return X pipe = make_pipeline(PipelineDebugger(sms_featurizer), DecisionTreeClassifier()) pipe.fit(X_train, y_train) # FunctionFeaturizer # Before ======================================== # LOL .. *grins* .. I'm not babe, but thanks for thinking of me! # After ======================================== # [ 3. 0.06451613]