Ben Trent (@benwtrent) — bluesky.baby

Introducing RTEB: A New Standard for Retrieval Evaluation We’re on a journey to advance and democratize artificial intelligence through open source and open science.

The @hf.co community is awesome. Real work that moves everyone forward: huggingface.co/blog/rteb

01.10.2025 16:22 👍 3 🔁 1 💬 0 📌 0

Lucene™ Core News Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...

Apache Lucene 10.3.0 is released! 40% faster lexical search is absolutely crazy for a project that has been doing lexical search for a quarter of a century lucene.apache.org/core/corenew...

19.09.2025 20:20 👍 4 🔁 0 💬 0 📌 0

Elasticsearch vector search: Excluding vectors from source - Elasticsearch Labs Elasticsearch now excludes vectors from source by default, saving space and improving performance while keeping vectors accessible when needed.

Storing floating point values as a big 'ole JSON blob is silly, so we stopped doing that. Great stuff from Jim on making vector search in Elasticsearch substantially cheaper! www.elastic.co/search-labs/...

27.08.2025 13:42 👍 3 🔁 0 💬 0 📌 0

Hybrid search live coded from scratch RAG systems all use vector databases. HNSW (Hierarchical Navigable Small Worlds) is the most common algorithm. If you want to build RAG, you should appreciate how this algorithm works (Missed previous...

Next in the series of building a search engine from scratch - we focus on hybrid retrieval with @benwtrent.bsky.socialof Elastic.

How do you add filtering to a vector search index?

I'll code. He'll yell at me.

maven.com/p/430592/hyb...

21.05.2025 13:15 👍 3 🔁 1 💬 0 📌 0

Sounds like fun!

15.05.2025 23:34 👍 1 🔁 0 💬 1 📌 0

Lucene™ Core News Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...

It's time to redo benchmarks! #Lucene 10.2 was just released, with
- huge speedups to non-scoring boolean queries, range queries and filtered vector search,
- better merging defaults for faster search,
- much faster merging of vectors
And more...
lucene.apache.org/core/corenew...

12.04.2025 06:27 👍 6 🔁 1 💬 1 📌 0

Lucene will now intelligently merge HNSW graphs: elastic.co/search-labs/... Now indexing and merging is much cheaper, reducing the compute required and improving indexing throughput

08.04.2025 12:57 👍 0 🔁 0 💬 0 📌 0

Indexing and merging times are getting better for #Apache #Lucene vector search. Lucene has a read-only segment architecture. One of the drawbacks of this approach is throwing away previously completed work when merging HNSW graphs. Well, this got better :)

08.04.2025 12:57 👍 2 🔁 1 💬 1 📌 0

Filtered HNSW & kNN search: Making searches faster - Elasticsearch Labs Explore the improvements we have made for HNSW vector search in Apache Lucene through our ACORN-1 algorithm implementation.

Read more about it here: elastic.co/search-labs/...

And yes, my child did the header art work. I much prefer it to yet another piece of AI generated guff. Though, the "acorn" that the "squirrel" is holding got cropped out. 🙈

28.02.2025 15:39 👍 0 🔁 0 💬 0 📌 0

This this new algorithm, we have seen 3-5x fewer vector operations to achieve the same recall on previously horribly performing filter percentages.

28.02.2025 15:39 👍 0 🔁 0 💬 1 📌 0

We have implemented a variation of the ACORN-1. arxiv.org/abs/2403.04871 The key idea is expanding your HNSW neighborhood search, and only score candidates matching your filter criteria.

28.02.2025 15:39 👍 0 🔁 0 💬 1 📌 0

Filtered vector search is crazy important. So we made HNSW filtered search in Apache Lucene better. At similar recall, it can be 3-5x faster!

28.02.2025 15:39 👍 5 🔁 1 💬 1 📌 0

Elasticsearch history: 15 years of indexing and searching - Elasticsearch Labs Elasticsearch just turned 15-years-old! Take a look back at the last 15 years of indexing and searching, and turn to the next 15 years of relevance.

"elasticsearch: 15 years of indexing it all, finding what matters": www.elastic.co/search-labs/...
we turned it into a proper blog post with shay :)

13.02.2025 23:54 👍 3 🔁 1 💬 0 📌 0

Binary Vectors & Fuzzy Facets: Clustering Results in a Browser Using Binary Vectors YouTube video by Official Elastic Community

I really enjoyed this talk by @elasticmark.bsky.social. He is back at finding crazy & interesting ways to explore data (I guess he never stopped). Clustering with binary vectors & vector search with Elasticsearch www.youtube.com/watch?v=sJU_...

13.02.2025 21:57 👍 2 🔁 0 💬 0 📌 1

aoli-al - Overview aoli-al has 119 repositories available. Follow their code on GitHub.

This also shows the beauty of OpenSource software. Out of nowhere Leo (github.com/aoli-al) comes to save the day, finding and helping fix tricky concurrency bugs in Apache Lucene.

07.02.2025 15:59 👍 0 🔁 0 💬 0 📌 0

GitHub - cmu-pasta/fray: A controlled concurrency testing framework for the JVM A controlled concurrency testing framework for the JVM - cmu-pasta/fray

Fray is honestly pretty easy to use, provides deterministic play back of concurrency failures, and automatically detects any concurrency failures through sequential execution of threads: github.com/cmu-pasta/fray

07.02.2025 15:59 👍 0 🔁 0 💬 1 📌 0

Concurrency bugs in Lucene: How to fix optimistic concurrency failures - Elasticsearch Labs Thanks to Fray, a deterministic concurrency testing framework from CMU’s PASTA Lab, we tracked down a tricky Lucene bug and squashed it

It's wonderful to see practical & important programming work. Debugging concurrent programs is incredibly difficult, here is a bug found in Apache Lucene by the CMU Pasta Lab using their new Fray testing framework www.elastic.co/search-labs/...

07.02.2025 15:59 👍 2 🔁 1 💬 1 📌 0

The number of improvements in Lucene here are crazy. Pretty much every count and boolean query gets a nice boost and some of the count improvements are hilarious 🚀🚀🚀.

15.01.2025 18:28 👍 5 🔁 0 💬 1 📌 0

Lucene Wrapped 2024 - Elasticsearch Labs 2024 has been another major year for Apache Lucene. In this blog, we’ll explore the key highlights.

It's so cool to see #Apache #Lucene going strong after about a quarter of a century 🤯. 2025 is gonna be a fun year for Lucene. www.elastic.co/search-labs/...

10.01.2025 13:32 👍 3 🔁 0 💬 0 📌 0

Early termination in HNSW for faster approximate KNN search - Elasticsearch Labs Learn how HNSW can be made faster for KNN search, using smart early termination strategies.

Early termination for vector search can be more than just "gathering K candidates" my colleague Tommaso gives a small overview of basic early termination strategies for vector index search. www.elastic.co/search-labs/...

07.01.2025 15:19 👍 3 🔁 0 💬 0 📌 0

Optimized Scalar Quantization: Even Better Binary Quantization - Elasticsearch Labs Here we explain optimized scalar quantization in Elasticsearch and how we used it to improve Better Binary Quantization (BBQ).

My team wrote a new backing algorithm for our BBQ indices, called Optimized Scalar Quantization. Here is a high level overview of its implementation in Elasticsearch (and soon Apache Lucene). www.elastic.co/search-labs/... for the math nerds, skip to Tom's blog: www.elastic.co/search-labs/...

06.01.2025 18:13 👍 2 🔁 1 💬 0 📌 0

Lucene CountAndHighHigh queries/sec

Lucene has been evaluating disjunctive queries by loading (windows of) postings into a bit set and or-ing these bit sets for 20+ years. It started using the same approach for conjunctive queries a few days ago. benchmarks.mikemccandless.com/CountAndHigh... (annotation HS)

21.12.2024 16:37 👍 2 🔁 1 💬 1 📌 0

Lucene bug adventures: Fixing a corrupted index exception - Elasticsearch Labs Sometimes, a single line of code takes days to write. Here, we get a glimpse of an engineer's pain and debugging over multiple days to fix a potential Apache Lucene index corruption.

Something a little different from my typical blogs. This line of code in Apache Lucene took me 3 days to write. For fixing bugs, it's about the journey, not necessarily the destination. www.elastic.co/search-labs/... (the cover art was provided by one of my kids :))

27.12.2024 17:16 👍 4 🔁 1 💬 1 📌 0

Understanding optimized scalar quantization - Elasticsearch Labs In this post we explain a new form of scalar quantization we've developed at Elastic that achieves state-of-the-art accuracy for binary quantization

Our Better Binary Quantization (BBQ) index in Elasticsearch has a new backing algorithm. Better(er) recall & query speed for vector search. Its a natural evolution of our scalar quantization. Shipping soon. It's pretty neat www.elastic.co/search-labs/...

20.12.2024 16:14 👍 3 🔁 1 💬 0 📌 0

Ensuring business rules work seamlessly with semantic search - Elasticsearch Labs Harness the power of query rules combined with semantic search and rerankers.

Elasticsearch just got more powerful. Now, semantic, hybrid, and vector retrieval with custom rules for pinning and bubbling results to the top! Now you have multi-phased, hybrid retrieval in combination with business rules :D www.elastic.co/search-labs/...

19.12.2024 15:36 👍 1 🔁 0 💬 0 📌 0

Quantization: The Important Bits You know, for search, an Elastic podcast · Episode

It was so much fun talking #Elasticsearch with Steve Mayzak on “You Know, For Search”. I could nerd out for hours, but we kept it down to just 1 hour (maybe even that is too long....). Give it a listen, if nothing else, for Steve's dulcet tones: open.spotify.com/episode/7HLH...

11.12.2024 13:07 👍 2 🔁 3 💬 0 📌 0

Exploring depth in a 'retrieve-and-rerank' pipeline - Elasticsearch Labs Select an optimal re-ranking depth for your model and dataset.

Be prepared to learn more about semantic rerankers than you ever thought you needed to know. Another awesome analysis from my colleagues at Elasticsearch www.elastic.co/search-labs/...

05.12.2024 16:49 👍 7 🔁 0 💬 0 📌 0

Smokin' fast BBQ with hardware accelerated SIMD instructions - Search Labs How we optimized vector comparisons in BBQ with hardware accelerated SIMD (Single Instruction Multiple Data) instructions.

More magic from chef Chris Hegarty. How better binary quantization vector ops are accelerated with Java SIMD in Elasticsearch vector search www.elastic.co/search-labs/...

04.12.2024 19:16 👍 4 🔁 0 💬 0 📌 0

Do less with serverless: Elastic Cloud Serverless — Now GA Elastic Cloud Serverless is the easiest way to start and scale your capabilities in search, observability and security. Built on a reimagined Elasticsearch architecture, it ensures low-latency queryin...

I cannot adequately express how proud I am of the #Elasticsearch team for delivering this. It is a humungous engineering achievement and the results of (metaphorical) blood, sweat, and (maybe real ;) ) tears. go.es.io/3CVo82X

02.12.2024 15:08 👍 2 🔁 0 💬 0 📌 0

We have seen this idea played out nicely with Tantivy and Apache Lucene. Benchmarking between each other and lovingly borrowing ideas between the projects.

26.11.2024 21:32 👍 3 🔁 2 💬 0 📌 0

Ben Trent

Latest posts by Ben Trent @benwtrent