The @hf.co community is awesome. Real work that moves everyone forward: huggingface.co/blog/rteb
The @hf.co community is awesome. Real work that moves everyone forward: huggingface.co/blog/rteb
Apache Lucene 10.3.0 is released! 40% faster lexical search is absolutely crazy for a project that has been doing lexical search for a quarter of a century lucene.apache.org/core/corenew...
Storing floating point values as a big 'ole JSON blob is silly, so we stopped doing that. Great stuff from Jim on making vector search in Elasticsearch substantially cheaper! www.elastic.co/search-labs/...
Next in the series of building a search engine from scratch - we focus on hybrid retrieval with @benwtrent.bsky.socialof Elastic.
How do you add filtering to a vector search index?
I'll code. He'll yell at me.
maven.com/p/430592/hyb...
Sounds like fun!
It's time to redo benchmarks! #Lucene 10.2 was just released, with
- huge speedups to non-scoring boolean queries, range queries and filtered vector search,
- better merging defaults for faster search,
- much faster merging of vectors
And more...
lucene.apache.org/core/corenew...
Lucene will now intelligently merge HNSW graphs: elastic.co/search-labs/... Now indexing and merging is much cheaper, reducing the compute required and improving indexing throughput
Indexing and merging times are getting better for #Apache #Lucene vector search. Lucene has a read-only segment architecture. One of the drawbacks of this approach is throwing away previously completed work when merging HNSW graphs. Well, this got better :)
Read more about it here: elastic.co/search-labs/...
And yes, my child did the header art work. I much prefer it to yet another piece of AI generated guff. Though, the "acorn" that the "squirrel" is holding got cropped out. π
This this new algorithm, we have seen 3-5x fewer vector operations to achieve the same recall on previously horribly performing filter percentages.
We have implemented a variation of the ACORN-1. arxiv.org/abs/2403.04871 The key idea is expanding your HNSW neighborhood search, and only score candidates matching your filter criteria.
Filtered vector search is crazy important. So we made HNSW filtered search in Apache Lucene better. At similar recall, it can be 3-5x faster!
"elasticsearch: 15 years of indexing it all, finding what matters": www.elastic.co/search-labs/...
we turned it into a proper blog post with shay :)
I really enjoyed this talk by @elasticmark.bsky.social. He is back at finding crazy & interesting ways to explore data (I guess he never stopped). Clustering with binary vectors & vector search with Elasticsearch www.youtube.com/watch?v=sJU_...
This also shows the beauty of OpenSource software. Out of nowhere Leo (github.com/aoli-al) comes to save the day, finding and helping fix tricky concurrency bugs in Apache Lucene.
Fray is honestly pretty easy to use, provides deterministic play back of concurrency failures, and automatically detects any concurrency failures through sequential execution of threads: github.com/cmu-pasta/fray
It's wonderful to see practical & important programming work. Debugging concurrent programs is incredibly difficult, here is a bug found in Apache Lucene by the CMU Pasta Lab using their new Fray testing framework www.elastic.co/search-labs/...
The number of improvements in Lucene here are crazy. Pretty much every count and boolean query gets a nice boost and some of the count improvements are hilarious πππ.
It's so cool to see #Apache #Lucene going strong after about a quarter of a century π€―. 2025 is gonna be a fun year for Lucene. www.elastic.co/search-labs/...
Early termination for vector search can be more than just "gathering K candidates" my colleague Tommaso gives a small overview of basic early termination strategies for vector index search. www.elastic.co/search-labs/...
My team wrote a new backing algorithm for our BBQ indices, called Optimized Scalar Quantization. Here is a high level overview of its implementation in Elasticsearch (and soon Apache Lucene). www.elastic.co/search-labs/... for the math nerds, skip to Tom's blog: www.elastic.co/search-labs/...
Lucene has been evaluating disjunctive queries by loading (windows of) postings into a bit set and or-ing these bit sets for 20+ years. It started using the same approach for conjunctive queries a few days ago. benchmarks.mikemccandless.com/CountAndHigh... (annotation HS)
Something a little different from my typical blogs. This line of code in Apache Lucene took me 3 days to write. For fixing bugs, it's about the journey, not necessarily the destination. www.elastic.co/search-labs/... (the cover art was provided by one of my kids :))
Our Better Binary Quantization (BBQ) index in Elasticsearch has a new backing algorithm. Better(er) recall & query speed for vector search. Its a natural evolution of our scalar quantization. Shipping soon. It's pretty neat www.elastic.co/search-labs/...
Elasticsearch just got more powerful. Now, semantic, hybrid, and vector retrieval with custom rules for pinning and bubbling results to the top! Now you have multi-phased, hybrid retrieval in combination with business rules :D www.elastic.co/search-labs/...
It was so much fun talking #Elasticsearch with Steve Mayzak on βYou Know, For Searchβ. I could nerd out for hours, but we kept it down to just 1 hour (maybe even that is too long....). Give it a listen, if nothing else, for Steve's dulcet tones: open.spotify.com/episode/7HLH...
Be prepared to learn more about semantic rerankers than you ever thought you needed to know. Another awesome analysis from my colleagues at Elasticsearch www.elastic.co/search-labs/...
More magic from chef Chris Hegarty. How better binary quantization vector ops are accelerated with Java SIMD in Elasticsearch vector search www.elastic.co/search-labs/...
I cannot adequately express how proud I am of the #Elasticsearch team for delivering this. It is a humungous engineering achievement and the results of (metaphorical) blood, sweat, and (maybe real ;) ) tears. go.es.io/3CVo82X
We have seen this idea played out nicely with Tantivy and Apache Lucene. Benchmarking between each other and lovingly borrowing ideas between the projects.