Chonky_mmbert_small_multilingual_v1 Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems. 🆕 Now multilingual! Model Description The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline. ⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192). How to use I've made a small python library for this model: chonky Here is the usage: from src.chonky import ParagraphSplitter # on the first run it will download the transformer model splitter = ParagraphSplitter( model_id="mirth/chonky_mmbert_small_multilingual_1", device="cpu" ) text = ( "Before college the two main things I worked on, outside of school, were writing and programming. " "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. " "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. " "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' " "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, " "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — " "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights." ) for chunk in splitter(text): print(chunk) print("--") Sample Output: Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had ha...
First seen: 2025-10-25 12:17
Last seen: 2025-10-26 12:02