As some of you know I have been working on a machine learning library for .NET called numl. The main purpose of the library is to abstract away some of the mundane issues surrounding setting up the learning problem in the first place. Additionally sometimes the math in machine learning seems to be a bit daunting (some of it is indeed daunting) so the library allows you to either get into the math or trust that these things are implemented and run correctly.

In order to facilitate this type of abstraction I came to realize that the best way to bridge this gap was to use constructions that most would have already either used or understood: classes. The learning problem, as I understood it, was taking a set of things and trying to learn a way to predict a particular aspect of these things. The best approach therefore was to allow for an easy way to markup these things (or classes) in order to produce an efficient technique for setting up the learning problem.

I settled for an attribute based system that looks something like this:

public class Iris
{
        [Feature]
        public decimal SepalLength { get; set; }

        [Feature]
        public decimal SepalWidth { get; set; }

        [Feature]
        public decimal PetalLength { get; set; }

        [Feature]
        public decimal PetalWidth { get; set; }

        [StringLabel]
        public string Class { get; set; }
}

This approach is intended to allow for quick and easy feature selection in a straightforward and intuitive manner.

Now for the topic at hand. Given that this is simply a bridge between the common class structure and the actual machine learning algorithms I needed to provide the actual bridging mechanism into the library. I finally landed on calling this mechanism a Descriptor. I thought the name was fairly indicative of exactly what job it handled: describing the learning problems in terms of features and the corresponding label (in the case of the supervised problem). In essence it describes the machine learning problem to the mathematical side of the algorithms. Its dual responsibility is to also describe the outcome of the algorithms back to the original structure in terms that it can understand. The Descriptor therefore becomes the literal bridge by describing the problem in terms of Matrices/Vectors to the algorithms while projecting the results of a prediction (which happens in terms of vectors) back to the original object wherein the problem is described.

I think a concrete example would make more sense:

var data = Iris.Load();

var description = Descriptor.Create<Iris>();

var generator = new DecisionTreeGenerator(50);
var model = generator.Generate(description, data);

// should be Iris-Setosa
Iris iris = new Iris
{
    PetalWidth = 0.5m,
    PetalLength = 2.3m,
    SepalLength = 2.1m,
    SepalWidth = 2.1m
};

iris = model.Predict<Iris>(iris);

Here is the general workflow for running a supervised learning problem:

  1. Create a descriptor (line 3)
  2. Create a generator to build a model (lines 5-6)
  3. Use the model for prediction (lines 9-17)

In this case we are loading a bunch of data into a collection of Iris objects. The Descriptor in this example participates in every phase of the learning and prediction process. The first area of participation is obvious: a concrete descriptor is created in line 3 based upon the Iris type. This uses simple reflection to find all of the corresponding attributes and subsequently adds features and label to the descriptor. The other two areas where the descriptor participates are not as obvious from the code.

When the generator builds a model it relies heavily on the descriptor to create the numerical representation of the data set. In fact, if there is no descriptor then the generator throws an exception.

Once the model is created a ready for prediction, the descriptor is once again used in order to convert the object to a vector representation and then fill in the appropriate property with the model prediction.

An Alternate Approach

Creating descriptors automatically from marked up classes proved to be a useful abstraction in most cases. As I continued to test the library I noticed that an alternate pattern emerged around data that required late binding. Data structures such as collections of dictionary objects, DataTables, and even Dynamic objects like ExpandoObject would preclude the ability to mark up a class where none existed. I also noticed that in this case creating a descriptor proved to be cumbersome (well at least I felt the experience could be improved). In this case I decided to add a fluent interface to the Descriptor in order to describe objects that would never be marked up with attributes but would still participate in the learning process:

var d = Descriptor.New()
                  .With("SepalLength").As(typeof(decimal))
                  .With("SepalWidth").As(typeof(double))
                  .With("PetalLength").As(typeof(decimal))
                  .With("PetalWidth").As(typeof(int))
                  .Learn("Class").As(typeof(string));

This approach yields the exact same descriptor as the one created above using the marked up class.

I am interested in your thoughts behind this approach and look forward to your suggestions and corrections.