The End of the Train-Test Split

https://news.ycombinator.com/rss Hits: 1
Summary

2015: Loosely based on a true story. You are a machine learning engineer at Facebook in Menlo Park. Your task: build the best butt classification model, which decides if there is an exposed butt in an image. The content policy team in D.C. has written country-specific censorship rules based on cultural tolerance for gluteal cleft鈥攐r butt crack, for the uninitiated. Germany: 0% cleft. Zimbabwe: 30% cleft. Cupertino: 0%. Montana: 20%. A PM on your team writes data labeling guidelines for a business process outsourcing firm (BPO), and each example in your dataset is triple-reviewed by the firm's outsourced team to ensure consistency. You skim the labels, which seem reasonable. import torch import pandas as pd from torch.utils.data import DataLoader, TensorDataset from sklearn.model_selection import train_test_split df = pd.read_csv("gluteal_cleft_labels.csv") X = df.drop("label", axis=1).values y = df["label"].values x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) You decide to train a CNN: it'll be perfect for this edge detection task. Two months later, you've cracked it. Your model goes live, great success, 92% precision, 98% recall. You never once had to talk to the policy team in D.C. 2023: The butt model has been in production for 8 years. Another email: Policy has heard about LLMs and thinks it's time to build a more "context-aware" model. They would like the model to understand whether there is sexually suggestive posing, sexual context, or artistic intent in the image. You receive a 10 page policy doc. The PM cleans it up a bit and sends it to the BPO. The data is triple reviewed, you skim the labels, and they seem fine. You make an LLM decision tree, one LLM call per policy section, and aggregate the results. Two months pass. You are stuck at 85% precision and recall, no matter how much prompt engineering and model tuning you do. You try going back to a CNN, training it on the labels. It scores 83%. Your data science s...

First seen: 2025-12-04 18:13

Last seen: 2025-12-04 18:13