Descriptors in numl

As some of you know I have been working on a machine learning library for .NET called numl. The main purpose of the library is to abstract away some of the mundane issues surrounding setting up the learning problem in the first place. Additionally sometimes the math in machine learning seems to be a bit daunting (some of it is indeed daunting) so the library allows you to either get into the math or trust that these things are implemented and run correctly.

In order to facilitate this type of abstraction I came to realize that the best way to bridge this gap was to use constructions that most would have already either used or understood: classes. The learning problem, as I understood it, was taking a set of things and trying to learn a way to predict a particular aspect of these things. The best approach therefore was to allow for an easy way to markup these things (or classes) in order to produce an efficient technique for setting up the learning problem.

I settled for an attribute based system that looks something like this:

public class Iris
{
        [Feature]
        public decimal SepalLength { get; set; }

        [Feature]
        public decimal SepalWidth { get; set; }

        [Feature]
        public decimal PetalLength { get; set; }

        [Feature]
        public decimal PetalWidth { get; set; }

        [StringLabel]
        public string Class { get; set; }
}

This approach is intended to allow for quick and easy feature selection in a straightforward and intuitive manner.

Now for the topic at hand. Given that this is simply a bridge between the common class structure and the actual machine learning algorithms I needed to provide the actual bridging mechanism into the library. I finally landed on calling this mechanism a Descriptor. I thought the name was fairly indicative of exactly what job it handled: describing the learning problems in terms of features and the corresponding label (in the case of the supervised problem). In essence it describes the machine learning problem to the mathematical side of the algorithms. Its dual responsibility is to also describe the outcome of the algorithms back to the original structure in terms that it can understand. The Descriptor therefore becomes the literal bridge by describing the problem in terms of Matrices/Vectors to the algorithms while projecting the results of a prediction (which happens in terms of vectors) back to the original object wherein the problem is described.

I think a concrete example would make more sense:

var data = Iris.Load();

var description = Descriptor.Create<iris>();

var generator = new DecisionTreeGenerator(50);
var model = generator.Generate(description, data);

// should be Iris-Setosa
Iris iris = new Iris
{
    PetalWidth = 0.5m,
    PetalLength = 2.3m,
    SepalLength = 2.1m,
    SepalWidth = 2.1m
};

iris = model.Predict<iris>(iris);

Here is the general workflow for running a supervised learning problem:

  1. Create a descriptor (line 3)
  2. Create a generator to build a model (lines 5-6)
  3. Use the model for prediction (lines 9-17)

In this case we are loading a bunch of data into a collection of Iris objects. The Descriptor in this example participates in every phase of the learning and prediction process. The first area of participation is obvious: a concrete descriptor is created in line 3 based upon the Iris type. This uses simple reflection to find all of the corresponding attributes and subsequently adds features and label to the descriptor. The other two areas where the descriptor participates are not as obvious from the code.

When the generator builds a model it relies heavily on the descriptor to create the numerical representation of the data set. In fact, if there is no descriptor then the generator throws an exception.

Once the model is created a ready for prediction, the descriptor is once again used in order to convert the object to a vector representation and then fill in the appropriate property with the model prediction.

An Alternate Approach

Creating descriptors automatically from marked up classes proved to be a useful abstraction in most cases. As I continued to test the library I noticed that an alternate pattern emerged around data that required late binding. Data structures such as collections of dictionary objects, DataTables, and even Dynamic objects like ExpandoObject would preclude the ability to mark up a class where none existed. I also noticed that in this case creating a descriptor proved to be cumbersome (well at least I felt the experience could be improved). In this case I decided to add a fluent interface to the Descriptor in order to describe objects that would never be marked up with attributes but would still participate in the learning process:

var d = Descriptor.New()
                  .With("SepalLength").As(typeof(decimal))
                  .With("SepalWidth").As(typeof(double))
                  .With("PetalLength").As(typeof(decimal))
                  .With("PetalWidth").As(typeof(int))
                  .Learn("Class").As(typeof(string));

This approach yields the exact same descriptor as the one created above using the marked up class.

I am interested in your thoughts behind this approach and look forward to your suggestions and corrections.

11 Comments, RSS

  1. Russ Cam April 27, 2013 @ 12:04 pm

    Just came across your library after enrolling on the Machine Learning course on coursera (https://class.coursera.org/ml-003/class/index).

    Thank you for the effort that you have put in to building an ML library for .NET; looking forward to trying it out on some real-world data shortly!

  2. Nelson Silva October 19, 2013 @ 2:43 pm

    I've just found your page...
    NICE work! TKS!

  3. Petr October 24, 2013 @ 4:03 am

    Hello, thank you very much for creating open-source .NET machine learning library. I really like its API and clear code of its internals. But numl obvously lacks of documentation and usage examples.

    I am pretty new to machine learning and read all information on numl that you provided on numl site and your blog. But I've failed to use it for my scenario. It would be very appreciated if you can give an advise.

    I need to implement automatic classification of documents. Each doc has [StringFeature] Title, [StringFeature] Text and [StringLabel] Topic. I have a data set for learning - several hundred of docs with Topic set manually. So I tried to do the following:
    var descriptor = Descriptor.Create();
    var generator = new NaiveBayesGenerator(2);
    generator.Descriptor = descriptor;
    var model = Learner.Learn(dataForGenerator, 1, 10, generator);

    But it fails at the edn of Learn() method because accuracy.MaxIndex() contains vector of NaN values. The same happened when I tried to use DecisionTree and Perceptron model generators.

    Could you please give an advise of what I am doing wrong?

    • Seth Juarez October 28, 2013 @ 5:35 pm

      Shoot me an email here: me AT sethjuarez DOT com so we can correspond there. I want to help!

    • Seth Juarez January 16, 2014 @ 9:44 am

      Hi Petr!
      Shoot me an email - would love to help!

  4. Tagir January 22, 2014 @ 11:41 am

    Hi Seth,
    Great article, I enjoyed it, interface seems really clean.

    Just one idea, I was wondering if Descriptor's fluent API can be modified to accept Expressions? That would make it more concise and allow for compile time validation. As a result the following call

    Descriptor.New()
    .With("SepalLength").As(typeof(decimal))
    .With("SepalWidth").As(typeof(double))

    would become


    Descriptor.For<Iris>()
    .With(i => i.SepalLength)
    .With(i => i.SepalWidth)

    I can try to do it myself if you accept contributions.

    Cheers,
    Tagir.

  5. Seth Juarez January 22, 2014 @ 2:01 pm

    I think it is a fantastic idea! As I see it there are three levels to descriptors:

    • Strong Descriptors: features/label burned into the classes through attributes. Makes it easy to declare things but it is highly coupled to the data type.
    • Weak Descriptors: features/label declared dynamically (with strings as I have it). A bit more difficult to declare but is completely agnostic to the data type (so long as it has the right property/type pairs). Currently this works with IEnumerables, DataTable, Dictionaries, and even anything of type dynamic (lie Expando)
    • Medium? Descriptors: I think this is where your suggestion falls. Not directly coupled with the type itself (as in Strong Descriptors) but still dependent on the shape

    Overall I think anything to decrease friction along the surface of interaction with the lib is a fantastic idea.

    -Seth

    ---------------------------EDIT
    Just made the change (it is building now).

  6. Michael Hansen August 17, 2014 @ 11:30 pm

    Came across your Numl library because it was mentioned on Reddit (which linked to http://stephenhaunts.com/2014/08/15/machine-learning-with-numl/)

    Looks very interesting and well built, but unfortunately it is released as GPLv2 - which makes it impossible for me (and many others) to use.

    I would love if you would consider another license model - for example BSD, Apache, MIT or similar - or LGPL as a minimum. This would open your library up for broader use.

    Whether or not I actually will / can use it will depend on further investigation - but as it is GPL it is unfortunately a total no-go for me :( even though I would love to dig deeper into using it.

    • Seth Juarez February 16, 2015 @ 12:05 pm

      Hi Michael:
      I have recently made the change to an MIT license. Hope this helps!
      -Seth

  7. Jon March 5, 2015 @ 2:16 pm

    First, I think this library is excellent since it simplifies ML for us beginners. I'm trying to build an application that will run a neural network on an arbitrary set of data. I do not have instances of a well defined class (which I can tag with attributes). Instead, I have a M by K matrix: M samples, K features. I also have a vector of size M which is the labels. It seems difficult to use the descriptor above if K isn't a compile time constant.

  8. Janus Knudsen December 1, 2015 @ 11:48 am

    Hello Seth
    Been looking at numl with great interest the last week, you make an impressive job.
    I like especially your channel 9, both the machine learning, and to my big surprise you interviewed one of the coolest js-dudes out there, the one and only Rob Eisenberg, which I really like. I have been on the Aurelia-boat since early sandbox I guess, amazing framework.

    I have one question for you and I believe you are the brightest among them all to answer it, can you feel the pressure :)

    What is you opinion on Tensorflow from google?

Your email address will not be published. Required fields are marked *

*