Machine learning systems are built from both code and data. It’s easy to reuse the code but hard to reuse the data, so building AI mostly means doing annotation. This is good, because the examples are how you program the behaviour – the learner itself is really just a compiler. What’s not good is the current technology for creating the examples. That’s why we’re pleased to introduce Prodigy, a downloadable tool for radically efficient machine teaching.
We’ve been working on Prodigy since we first launched Explosion AI last year, alongside our open-source NLP library spaCy and our consulting projects (it’s been a busy year!). During that time, spaCy has grown into the most popular library of its type, giving us a lot of insight into what’s driving success and failure for language understanding technologies. Most of those insights have been used to make spaCy better: AI DevOps was hard, so we made sure models could be installed via pip. Large models made CI tricky, so the new models are less than 1/10th the size.
Prodigy addresses the big remaining problem: annotation and training. The typical approach to annotation forces projects into an uncomfortable waterfall process. The experiments can’t begin until the first batch of annotations are complete, but the annotation team can’t start until they receive the annotation manuals. To produce the annotation manuals, you need to know what statistical models will be required for the features you’re trying to build. Machine learning is an inherently uncertain technology, but the waterfall annotation process relies on accurate upfront planning. The net result is a lot of wasted effort.
Prodigy solves this problem by letting data scientists conduct their own annotations, for rapid prototyping. Ideas can be tested faster than the first planning meeting could even be scheduled. We also expect Prodigy to reduce costs for larger projects, but it’s the increased agility we’re most excited about. Data science projects are said to have uneven returns, like start-ups: a minority of projects are very successful, recouping costs for a larger number of failures. If so, the most important problem is to find more winners. Prodigy helps you do that, because you get to try things much faster.
Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible, and try to guess the rest. Prodigy puts the model in the loop, so that it can actively participate in the training process and learns as you go. The model uses what it already knows to figure out what to ask you next. As you answer the questions, the model is updated, influencing which examples it asks you about next. In order to take full advantage of this strategy, Prodigy is provided as a Python library and command line utility, with a flexible web application. There’s a thin, and optional hosted component to make it easy to share annotation queues, but the tool itself is entirely under your control.
the Prodigy website.