S2 is incredibly cool, and now you can run it yourself!
S2 is incredibly cool, and now you can run it yourself!
I built a tool called LinkedOut to solve the "links in the air" problem during my talks. Itβs a serverless data lake built on the @cloudflare.social Data Platform.
Ingest: Pipelines Store: R2 + Apache Iceberg Security: Access
Real-time analytics with zero egress fees. Full build video below.
Itβs so dumb and also so good, just a big dude doing good in the world
Thatβs correct, the r2 sql project predated the arroyo acquisition, but weβll be converging over time. Also honored to be mentioned in the same post as the possible dbt acquisition! Ours wasβ¦a bit smaller.
Another brand new new feature is the R2 data catalog: blog.cloudflare.com/cloudflare-d...
Build something with Pipelines and R2 SQL. I suggest receiving OpenTelemetry data and then surfacing that in a web app (logs should be fairly straightforward), but there are tons of uses for this.
It's early, but I'm excited about direction that the Cloudflare Data Platform is taking. Trying to set up similar pipelines on other clouds would typically be $$$ and take tons of expertise. Managing kafka and multiple services for ingestion, compaction, etc blog.cloudflare.com/cloudflare-d...
The news is finally out! Cloudflare has a Data Platform! We're starting with serverless streaming pipelines (powered by arroyo), a managed Iceberg Catalog, and a new distributed SQL engine built on top of DataFusion
Sequin (sequinstream.com) is doing this, but focused on Postgres
Oh wow, youβre totally right. Saw the repo but didnβt actually look inside.
Firebolt at least is source-available: github.com/firebolt-db/...
Reminder: San Francisco @ApacheDataFusio meetup tomorrow: lu.ma/uuxd443e
Cloudflare is at Snowflake Summit in San Francisco this week!
Swing by our booth 2605 to chat about the new Cloudflare R2 Data Catalog and how it can make your data management and analytics easier!
I want that! We have completely separate code for object store and local filesystem, even though the latter is really only used for testing and dev.
Absolutely! Parquet and iceberg support are coming, and weβll consider Ducklake support if it starts getting traction.
Next Monday after the Snowflake Summit keynote! Hang out on our beautiful roof with other cool data folks, and hear some great speakers from LanceDB, @mooncakelabs.bsky.social, Eventual, Marimo, Bobsled, and @cloudflare-dev.bsky.social!
lu.ma/dbq1hfij
Better CI is a tough businessβ¦ Earthly couldnβt make it work despite building a great product earthly.dev/blog/shuttin...
Ok, y'all. This took me several weeks and a ton of help from @frankmcsherry.bsky.social and @lalithsuresh.bsky.social. I dug into timely dataflow, differential dataflow, and DBSP to get you up to speed on IVM engines and materialized views. Enjoy!
Iβm only a week into life at @cloudflare-dev.bsky.social but already amazed by how much of Cloudflare is built _on_ Cloudflare. Iβd never have guessed you could get so far with just workers + durable objects!
Arroyo is joining @cloudflare.social! We're bringing Arroyo to the Developer Platform as a serverless stream processing system, and will also remain open-source and self-hostable. www.arroyo.dev/blog/arroyo-...
Couple of big announcements from @cloudflare.social today for folk in #dataBS:
* Acquisition of Arroyo, launch of Pipelines for streaming ingestion: blog.cloudflare.com/cloudflare-a...
* Launch of R2 Data Catalogβa managed Apache Iceberg catalog for R2 blog.cloudflare.com/r2-data-cata...
Arroyo 0.14.0 is now available, including new lookup joins, support for nested updating aggregates, struct types, new syntax, and a bunch of improvements and fixes: www.arroyo.dev/blog/arroyo-...
I know by month 2 we're all inured to this stuff, but this is a beyond crazy mix of incompetence and illegality www.theatlantic.com/politics/arc...
SCO didnβt really turn evil, they were bought by Caldera which rebranded to SCO
With checkpoints slatedb is basically a streaming state backend in a box. Wish this had already existed when we started arroyo!
Amazing!
Arroyo is sitting at 3,999 stars... who's going to put us over the top github.com/ArroyoSystem...
I'll use a Python Jupyter notebook with DuckDB. You can convert results to a pandas dataframe then plot with matplotlib. ChatGPT is very good at writing the gluey Python bits.
You'd think that the key to being a fast streaming engine is like clever join algorithms, but it's mostly just being really good at JSON. Arroyo uses Arrow and the arrow-rs JSON decoder along with some streaming extensions. I think it's pretty cool, so I wrote up a long explanation of how it works
It combines a bunch of great services and tools to provide sub-minute-latency querying at a very low cost, including
* Redpanda serverless (Log storage)
* S3 (Object storage)
* Arroyo
* DuckDB
It went so well it felt worth documenting the process for other folks
Our team at Arroyo recently needed to rebuild our (very ad-hoc) analytics infra to account for our growth. We spent some time working out the best way to set up a near-real-time data lake today, and ended up with a pretty sweet approach we're calling the LOAD stack: www.arroyo.dev/blog/buildin...