An LLM Query Understanding Service

https://news.ycombinator.com/rss Hits: 4
Summary

We need to be cheating at search with LLMs. Indeed I’m teaching a whole course on this in July. With an LLM we can implement in days what previously took months. We can take apart a query like “brown leather sofa” into the important dimensions of intent — “color: brown, material: leather, category:couches” etc. With this power all search is structured now. Even better we can do this all without calling out to OpenAI/Gemini/…. We can use simple LLMs running in our infrastructure making it faster and cheaper. I’m going to show you how. Let’s get started. Follow along in this repo. The service - wrapping an open source LLM We’ll start by deploying a FastAPI app that calls an LLM. The code below is just a dummy “hello world” app talking to an LLM. We send a chat message over JSON, the LLM comes up with a response and we send it back. Here’s the basic service: from fastapi import FastAPI, Request from fastapi.responses import JSONResponse from llm_query_understand.llm import LargeLanguageModel from time import perf_counter app = FastAPI() llm = LargeLanguageModel() @app.post("/chat") async def chat(request: Request): body = await request.json() prompt = body.get("msg") response = llm.generate(prompt, max_length=100) resp = { "response": response, "prompt": prompt } return JSONResponse(content=resp) And calling a light LLM (Qwen2-7B) via pytorch: from transformers import AutoTokenizer, AutoModelForCausalLM import torch import os DEVICE = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu") class LargeLanguageModel: def __init__(self, device=DEVICE, model="Qwen/Qwen2-7B"): self.device = device self.tokenizer = AutoTokenizer.from_pretrained(model) self.model = AutoModelForCausalLM.from_pretrained( model, torch_dtype=torch.float16, device_map="auto" ).to(self.device) def generate(self, prompt: str, max_length: int = 100): inputs = self.tokenizer(prompt, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inp...

First seen: 2025-04-09 14:35

Last seen: 2025-04-09 17:37