Learning from Heuristics

https://news.ycombinator.com/rss Hits: 1
Summary

Summary I present a weak supervision paradigm called “data programming” which uses maximum likelihood estimation to produce soft labels from heuristics. These soft labels can then be used to train other models, without true labels being required at any stage. I’ve included a simple example from first principles to show that the methods work. The original authors have a fully featured package called Snorkel which provides sophisticated data programming and related features. Introduction There’s a nice paper from 2016 called “Data Programming: Creating Large Training Sets, Quickly” in which the authors lay out a simple paradigm for training binary classification models from a set of heuristic “labeling functions”. Given some data \(x\), a labeling function \(f_i(x) \in \{-1,0,1\}\) abstains (0) at an unknown rate \(\beta_i\), and correctly matches the true (but unknown) label (-1 or 1) at a rate \(\alpha_i\). In plain language, a labeling function outputs labels for some data samples – but not necessarily all – and they might be wrong. The objective is to use a set of labeling functions to guess the true labels, and then train a model with that. To get this off the ground, we must assume that our labeling functions are correlated with the true labels, even though we do not know how often they apply nor how often they will be correct. Most importantly, and conversely, this implies that the true labels are correlated with the labeling functions, which is what allows us to estimate them from the labeling functions. We have to get a little less handwavy at this point. Conceptually we need to start by estimating \(P(\Omega(x) \mid Y=y)\), where \(\Omega(x) \in \{-1,0,1\}^m\) are our \(m\) labeling functions applied to data \(x \in X\), and \(y\in Y\) are the true but unknown labels. Once we’ve done that, we can use the conditional probability formula calculate the soft labels we need to train some other model directly on \(X\). Step 1: Estimating the likelihood function We...

First seen: 2025-09-03 06:54

Last seen: 2025-09-03 06:54