Vector search using only Parquet and DataFusion – Xiangpeng’s blog
Just tune your existing stack
"You need a vector database" is the default advice but complexity means there's new infrastructure, new formats, and data coordination overhead. What if tuning Parquet page sizes and adding lightweight footer metadata gave you vector search without any of that? blog.xiangpeng.systems/posts/vector...
20.02.2026 23:05
👍 0
🔁 0
💬 0
📌 0
Fifty Releases of Pointblank: a Year of Building Data Quality Tooling - Posit
The Pointblank Python library's journey has resulted in a premier, data-agnostic toolkit for tabular data validation.
Most data bugs aren't algorithm failures, they're assumption violations upstream. Pointblank makes validation a first-class citizen in R/Python workflows instead of an afterthought scattered in assert statements.
13.02.2026 15:47
👍 2
🔁 1
💬 0
📌 0
The General Inquirer in the time of LLMs: a BERTopic tutorial – CSS@IPP — Tutorials and resources
A. Morin
Here's what they don't tell you about topic modeling: preprocessing determines everything. Corpus homogeneity, document length, and context windows matter more than algorithm choice. BERTopic won't fix bad inputs. No model will.
06.02.2026 15:47
👍 6
🔁 2
💬 0
📌 0
Introduction to PostgreSQL Indexes
Who’s this for Basics How data is stored in disk How indexes speedup access to data Costs associated with indexes Disk Space Write operations Query planner Memory usage Types of Indexes Btree Hash…
Indexes speed up reads but slow writes. The real insight: indexes only help when you're returning <15-20% of rows. Beyond that, sequential scans win. Space and memory costs matter too. btree indexes often exceed table size.
05.02.2026 14:03
👍 1
🔁 0
💬 0
📌 0
Behold the power of the spectrum!
Eigenvalues as neurons: represent nonlinear models as the k-th eigenvalue of a learned symmetric matrix pencil. Explore monotonicity/convexity properties and train simple spectral models.
What if a model’s prediction was literally an eigenvalue? This post kicks off a thoughtful series exploring spectrum-based models as a middle ground between linear models and neural nets, with interpretability and robustness baked in. A fun, slightly off-the-beaten-path ML read.
05.02.2026 03:37
👍 1
🔁 0
💬 0
📌 0
Data to Art
A curated gallery celebrating data visualization as art. Discover innovative artworks from international artists who transform data into emotionally compelling visual experiences.
Data To Art is a new curated gallery transforming datasets into visual storytelling. Latest addition: Alisa Singer's Environmental Graphiti turns climate science into vibrant data-driven art. The line between science and abstraction is thinner than you think.
04.02.2026 01:13
👍 2
🔁 0
💬 0
📌 0
Databases in 2025: A Year in Review
The world tried to kill Andy off but he had to stay alive to to talk about what happened with databases in 2025.
If you're wondering where database tech is really headed in 2026, Andy Pavlo's annual review cuts through the noise. Sharding, Postgres evolution, and hot takes you won't find in vendor blogs.
09.01.2026 02:01
👍 1
🔁 0
💬 0
📌 0
Top Python libraries of 2025
Explore our 11th annual Top Python Libraries roundup, featuring two curated Top 10 lists for General Use and AI / ML / Data tools that matter today.
MarkItDown (Microsoft) converts a variety file types to Markdown optimized for LLMs. Handles structure, not just text. Pairs well with Kreuzberg for extraction (supports 50+ file formats!). If you're building RAG systems, these belong in your stack.
06.01.2026 01:37
👍 3
🔁 0
💬 0
📌 0
jax-js: an ML library for the web
JAX in pure JavaScript, as a flexible machine learning library and compiler.
jax-js compiles NumPy-style array code into WebAssembly and WebGPU kernels that run entirely client-side. No server, no dependencies, just JAX's programming model in your browser. This changes what's possible for interactive ML demos.
03.01.2026 15:34
👍 0
🔁 0
💬 0
📌 0
Soccer Analytics 2025 Review – Jan Van Haaren
Collection of the soccer analytics content that I liked the most in 2025!
Jan Van Haaren's 2025 soccer analytics review is a solid reference if you're doing applied ML in sports or want to see how domain experts handle sequential data. Covers spatio-temporal models, graph methods, Bayesian forecasting, and tracking data metrics. janvanhaaren.be/posts/soccer...
02.01.2026 20:57
👍 1
🔁 0
💬 0
📌 0
Benchmark Studies
It is impossible to disentangle technical innovation from technical debt
ML progress isn't driven by elegant theory. It's benchmarks, leaderboards, and engineering culture. In this post, Ben Recht explores why empirical testing beats clean math in practice and why that tension defines the field.
20.12.2025 16:02
👍 1
🔁 0
💬 0
📌 1
True Stories from the (Data) Battlefield – Part 1: Communicating About Data
A blog about data science, statistics, and data analysis with open-source software.
The paradox: show a complex graph and execs check their phones. Show a simple one and they demand endless breakdowns. The solution isn't more or less data. It's recentering discussions on the actual decision at hand. methodmatters.github.io/true-stories...
19.12.2025 15:47
👍 0
🔁 0
💬 0
📌 0
Useful patterns for building HTML tools
I’ve started using the term HTML tools to refer to HTML applications that I’ve been building which combine HTML, JavaScript, and CSS in a single file and use them to …
Your browser can run Python (Pyodide), execute OCR on PDFs, crop videos, and call LLM APIs—all without uploading anything to a server. The localStorage + CORS pattern makes surprisingly powerful tools possible with zero backend infrastructure.
18.12.2025 15:37
👍 0
🔁 0
💬 0
📌 0
Broken Chart: discover 9 visualization alternatives
Researcher in climate science at MBG-CSIC
Tired of charts that hide the story? Density plots + percentile intervals reveal what averages can't: full variability, extremes, historical context. Nine examples comparing Spain's temperature data show how geometry choice changes what readers understand.
17.12.2025 18:04
👍 1
🔁 0
💬 0
📌 0
Fisher arbitrarily chose p<0.05 a century ago and we've just... kept it. The problem: calling it "arbitrary" only works if you can suggest something less arbitrary. No one has, so here we are circling p-values close to 0.05 like it means something.
vilgot-huhn.github.io/mywebsite/po...
16.12.2025 13:13
👍 0
🔁 0
💬 0
📌 0
Haskell IS a Great Language for Data Science
I’ve been learning Haskell for a few years now and I am really liking a lot of the features, not least the strong typing and functional approach. I thought it was lacking some of the things I missed…
Haskell for data science? dataHaskell adds dataframes, NSE-style column operations, and compiler optimizations that turn chained operations into single-pass computations. Immutability + strong types + functional composition might be the combo we've been missing. jcarroll.com.au/2025/12/05/h...
15.12.2025 19:48
👍 6
🔁 0
💬 0
📌 0
A modern guide to SQL JOINs
There are many SQL JOINs guides and tutorials, but this one takes a different approach. We try to avoid misleading wording and imagery, and we structure the material in a different way. The goal of…
Learning SQL is like learning a foreign language: you need to read more variations than you'll actually write. Learn disciplined canonical syntax for your own queries, but understand the messy dialects others use.
05.12.2025 02:01
👍 1
🔁 0
💬 1
📌 0
Make Things, Tell People
On side projects and finding work
Side projects still open more doors than traditional applications in data roles. The trick: build small, interesting things that signal skills and make your work discoverable. Especially relevant given the current job market.
01.12.2025 13:37
👍 8
🔁 3
💬 0
📌 1
Lessons learned in starting a central data team
Learn how the MTA succeeded in setting up a central data team and a general purpose, cloud-based platform for data analytics.
Hot take: Your analysts doing "some basic data engineering" is killing your analytics function. The MTA hired 5 dedicated data engineers and it unlocked everything else. Stop asking data scientists to maintain pipelines. www.mta.info/article/less...
29.11.2025 03:47
👍 7
🔁 3
💬 0
📌 0
Python Rgonomics: User-defined functions in polars | Emily Riederer
Polars provides a consistent API for conducting transformations against a DataFrame. But what do you do when you need to apply a user-defined function beyond the native API? This post surveys the…
Sometimes the best polars pattern is knowing when to exit the DataFrame. partitionby() splits data into a dict of frames, letting you process with list comprehensions. Cleaner than forcing everything through mapgroups() when further wrangling isn't needed.
24.11.2025 13:37
👍 1
🔁 0
💬 0
📌 0
How I Use Every Claude Code Feature
A brain dump of all the ways I've been using Claude Code.
Most devs treat AI coding agents like infinite context machines. Reality: a 200k token window fills fast. The /compact feature is a trap. Better approach: /clear + document state in markdown, then resume. Treat context like disk space. You need a cleanup strategy.
23.11.2025 16:15
👍 1
🔁 0
💬 1
📌 0
A Visual Introduction to Dimensionality Reduction with Isomap
"To deal with hyper-planes in a 14-dimensional space, visualize a 3D space and say 'fourteen' to yourself very loudly. Everyone does it." - Geoffrey Hinton
Most modern dimensionality reduction (t-SNE, UMAP, Isomap) shares a pattern: represent data as a graph capturing local similarity, then embed to preserve that structure. It's graphs all the way down.
22.11.2025 03:47
👍 1
🔁 0
💬 0
📌 0
Torching the Modern-Day Library of Alexandria
“Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”
I curated some readings for class on "data tensions" and the list felt worth sharing. Come on a tour of datasets, books, the web, and AI with me...
We'll start with this piece on the Google Books project: the hopes, dreams, disasters, and aftermath of building a public library on the internet.
1/n
14.11.2025 16:39
👍 77
🔁 26
💬 6
📌 2
The Case Against pgvector | Alex Jacobs
What happens when you try to run pgvector in production and discover all the things the blog posts conveniently forgot to mention
Everyone's rushing to pgvector for "simple" vector search in Postgres. This reality check shows what actually happens at scale: indexing nightmares and performance walls. Simple isn't always sustainable in production.
14.11.2025 02:01
👍 0
🔁 0
💬 0
📌 0
What we lose when we surrender care to algorithms | Eric Reinhart
A dangerous faith in AI is sweeping American healthcare – with consequences for the basis of society itself
When healthcare becomes algorithmic, what gets optimized out? This Guardian essay asks the hard question about AI spreading through diagnostics and therapy: are we trading care quality for efficiency without realizing the cost?
13.11.2025 04:16
👍 1
🔁 0
💬 0
📌 0
Most marketplaces have SKUs. Etsy has 100M+ unique items with no standard attributes. How do you build filters when one listing is a "porcelain sculpture that looks like a t-shirt" and dimensions live in random photo text? www.etsy.com/codeascraft/...
06.11.2025 03:37
👍 2
🔁 0
💬 0
📌 0