Helpful advice. Thanks, Claude.
Helpful advice. Thanks, Claude.
I've posted my latest recap of the world of databases: www.cs.cmu.edu/~pavlo/blog/...
All the hot topics from the last year:
β’ More Postgres action!
β’ MCP for everyone!
β’ MongoDB gets litigious with FerretDB!
β’ File formats!
β’ Market movements!
β’ The richest person in the history of the world!
Is there anyone in my network with DuckDB skills who could review a PR that runs a Python script to compare the performance of DataFusion and DuckDB for some simple SQL queries?
github.com/apache/dataf...
There is a new Comet issue to discuss the future of Iceberg support and whether we should focus on using the iceberg-rust or Java implementation of Iceberg. Please add your thoughts if this is something that you care about!
github.com/apache/dataf...
On behalf of the DataFusion PMC, I'm excited to announce the release of version 0.11.0 of the Comet accelerator for Apache Spark!
datafusion.apache.org/blog/2025/10...
Itβs steak night tonight and our dog is patiently waiting for her share.
I like the name βRAD stackβ for this.
Check out the latest release of the Comet accelerator for Apache Spark
datafusion.apache.org/blog/2025/09...
Introducing Iron Vector: native, columnar, vectorized, high-performance accelerator for Apache Flink SQL and Table API built on top of Rust, Arrow and DataFusion.
Reduce your Flink compute cost by up to 2x or handle 2x more data with the same infrastructure.
We received reports of a phishing campaign targeting cratesβ.io users. Do not click on links asking to authenticate to protect your account. More information: blog.rust-lang.org/2025/09/12/c...
Thanks to @clflushopt.bsky.social, make massive TPCH datasets with tpchgen-cli 2.0:
SF1000 (1TB raw, 220GB in @ApacheParquet ) in less than 10 mins (6m45s) on aging laptop
Try it now:
pip install tpchgen-cli
tpchgen-cli --scale-factor 1000 --parts 100 --format=parquet
github.com/clflushopt/t...
I've been helping our analytics team integrate our DataFusion-based query engine for Postgres into EDB Postgres Distributed and finally here's an end-to-end demo.
You get HA Postgres plus seamless replication and DataFusion-based queries. This query turned out 6x faster than PG.
How my day is going
We now have a roadmap section in the Comet contributor guide, in case anyone was wondering what we are focusing on lately and what features will be arriving in future releases.
datafusion.apache.org/comet/contri...
Cassandra Team at Apple is searching for a fresh grad / person early in their career to join our ranks in SF/Bay Area!
Come work on super interesting problems with world class team. Help us build better Cassandra!
Ping me if youβre interested!
jobs.apple.com/en-us/detail...
It took me a really long time to understand the flow of execution between JVM and native code during query execution in Comet. I wish I had thought about adding a tracing capability earlier.
github.com/apache/dataf...
We're pleased to announce that Apache DataFusion in Python 46.0.0 is released! Since the last announcement post we've had a lot of great features and new contributors. Please check out the blog post with details.
datafusion.apache.org/blog/2025/03...
#DataFusion #Python #DataFrame #PyData #Apache
We have a position open in the Spark team at Apple, in our Cupertino, CA office. The role would include working on Apache DataFusion Comet.
jobs.apple.com/en-us/detail...
We have TPC-H benchmarks for single node with a small scale factor in the contributors guide. We only benchmark against Spark though and not against Spark RAPIDS.
datafusion.apache.org/comet/contri...
Here's the blog post announcing Comet 0.7.0
datafusion.apache.org/blog/2025/03...
I hate to say it, but "it depends". I'd recommend running your own benchmarks for your specific workloads. Performance will also vary greatly by environment (number of CPUs vs GPUs, different GPU types, and so on).
DataFusion Comet 0.7.0 is now available in Maven. We'll be publishing a blog post next week with all the details.
The repo has been updated with the latest benchmark results. For single executor TPC-H @ 100 GB, we now see a 2.2x increase over Spark (up from 2x in 0.6.0).
github.com/apache/dataf...
One month on, and I have zero regrets about quitting Facebook & Instagram.
I have replaced the scrolling time with listening to podcasts.
I now stay in touch with family overseas via email and photo sharing, and I use Snapchat for sharing photos with immediate family, privately. Works great.
Chris Riccomini (@chris.blue) shares his thoughts on Open Source foundations: Apache, CNCF, Commonhaus. He also explains why Commonhaus is a better fit for SlateDB
cnr.sh/posts/compar...
Comet 0.6.0 has been released. This is a smaller release than usual now that we have moved to an approximately monthly release cadence to match core DataFusion.
datafusion.apache.org/blog/2025/02...
Ballista 43.0.0 has been released, and now provides seamless integration with DataFusion.
datafusion.apache.org/blog/2025/02...
Check out this excellent presentation from @robtandy.bsky.social on his work with the DataFusion Ray project from last week's DataFusion community meetup.
It is a great overview of how to build a distributed system on top of DataFusion.
www.youtube.com/watch?v=ceTo...
This Week in DataFusion Comet (Jan 26):
github.com/apache/dataf...
Is this using Arrow and/or DataFusion? If so, our Discord is probably a good place to ask.
datafusion.apache.org/contributor-...
I've finally decided to quit using Facebook. My feed is overwhelmed with nonsense content that I am not interested in and cannot seem to block.
It is a real shame, though, because it was a good way to stay connected with family.
Is there a viable alternative? What are others using instead?