Bijil Subhash's Avatar

Bijil Subhash

@bijilsubhash

Data Engineer, Recovering Academic, and Entrepreneur | bijilsubhash.io | Sydney, Australia

102
Followers
232
Following
74
Posts
28.11.2024
Joined
Posts Following

Latest posts by Bijil Subhash @bijilsubhash

Databricks vs Fabric feels a lot like Pied Piper vs Nucleus. Fans of the Silicon Valley show will get the reference :)

#databs #databricks #fabric

08.03.2025 22:03 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(3/3) As a data engineer, I could have spent my time learning one of the shiny tools out there or could have relied on co-pilot to write code. I chose to learn what was happening under the hood in Python, mainly in the interest of building a stronger foundation, which should be a priority.

18.02.2025 16:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(2/3) I am no better and I am guilty of this myself, mainly from my own ignorance to writing Python idiomatically. This is why I spent the last 6 months re-learning Python, consuming multiple books, attending advanced lessons on specific topics, and wrote a #data ingestion package using Python.

18.02.2025 16:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(1/3) Among programming languages, I consider #Python to be a relatively easy to learn language, opening doors for many to start coding without formal training. However this also results in some poorly written, unmaintainable, and non-extensible code; the infamous spaghetti code.

18.02.2025 16:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Just finished watching the webinar on introducing SDF by dbt team. After seeing SDF in action, I have to admit that I am really looking forward to the future of dbt engine. I was wondering when dbt was going to bring in notable changes to the developer experience and this might be it.

#databs

01.02.2025 16:00 ๐Ÿ‘ 3 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I was speaking with someone who went all in on promoting duckdb to their clients. I did not get a chance to ask what exactly are they doing with duckdb. But I am curious to understand how duckdb is utilised in modern data pipelines.

21.01.2025 16:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(4/4) It is crazy to think that this news came out just 2 weeks into the new year. I cannot wait to see what dbt and others have in store for the next 12 months. I am excited to see how the integration between sdf and dbt will be rolled out. Maybe we will have some updates at Coalesce 2025!!!

18.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(3/4) I have always found the incredibly slow compile times with dbt deeply frustrating, which is of no concern to business users. But considering the early adoption of dbt is strongly rooted in the technical community, the acquisition of SDF to improve the developer experience is well timed.

18.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(2/4) I have used dbt quite extensively, and do enjoy its utility when it comes to data transformations and acknowledge that it is not going anywhere. However, I do feel that dbt core have not had any significant upgrade in a while, especially when it comes to improving its developer experience.

18.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(1/4) SDF acquisition by dbt

If you work in data, you probably would have come across a version of this headline this past week. A small disclaimer, I have not used SDF and neither do I have solid understanding of the tech that sits behind it, so take what I say with a grain of salt.

18.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(3/3) I am sure this will change as we iterate through the next generation of language models. However, if you are someone that is just starting out in the world of #data, I recommend you to reduce your reliance on code assist and instead spend some time understanding the basics of how SQL works.

16.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(2/2) But if you had to work with complex #SQL statements such as first touch attribution, finding the streak, calculating conversions etc with highly specific business logic and questionable quality of data, you probably will understand the sentiment that LLMs we have today are not powerful enough.

16.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(1/2) Maybe an unpopular opinion, SQL is a powerful language and despite what anyone says, it is unlikely to be replaced by an LLM, at least not with the models we have today. LLMs are powerful and can be leveraged to generate ideas or as a tool to unblock when you are stuck.

#databs

16.01.2025 16:00 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The thing that allows you to link data from the physical world in a format that a machine can understand coherently.

16.01.2025 03:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

What did they buy before?

15.01.2025 07:54 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

SDF was on my list of things to try out. I'll just wait till they integrate it to dbt now I guess :)

14.01.2025 20:45 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(3/3) Laktory also supports managing ETL pipelines, much like how you would do with dbt but with Spark and/or SQL. What I really like about Laktory is its ability to modularize the Databricks assets, which is big win when it comes to long term maintainability of your #data platform.

14.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(2/3) I am not affiliated with Laktory but if you are someone who works with Databricks and want to take a break from having to wrestle with Terraform/Pulumi, check out Laktory. It is an absolute game changer and I was able to go zero to managing multiple workspaces with a couple of yaml files.

14.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(1/3) Continuing from my previous thread on infrastructure as code for managing #Databricks. I have recently had the pleasure to work with an open source tool called Laktory, which is an abstraction that sits on top of Terraform/Pulumi to manage your Databricks workflow using YAML.

#databs

14.01.2025 16:00 ๐Ÿ‘ 2 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

A default approach that I take when it comes to data modelling. It works because OBT is optimized for the modern vectorized data warehouses. At the same time, the underlying data is modelled using established best practices from Kimball.

11.01.2025 20:38 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

This is the approach I default to. Also, I always thought that this was the only way OBT was used. I guess not.

11.01.2025 20:34 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0

To an extent, I agree. Though I still prefer to use documentation alongside AI assistance to verify some of the output.

11.01.2025 20:23 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I doubt there is one course that covers them all. You can always pick what part of DE you want to learn and drill down on that first. So that would be for things like ingestion, then transformation,, orchestration and so on

11.01.2025 03:42 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Senior data engineer!! That's tricky almost all of the content out there is generally catered towards early stage DEs. But I have heard Zach Wilson's boot camp is pretty good for seniors.

Disclaimer - I have not attended the boot camp, so take my advice with a grain of salt.

10.01.2025 05:16 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(2/2) I am no where near mastering it but I am happy that I was able to use it recently to build out a data platform on #Databricks and it was not as challenging as I imagined. Key takeaway, being comfortable with IaC can dramatically improve the efficiency and reliability of your #data pipelines.

10.01.2025 01:44 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(1/2) Infrastructure as code (IaC) is ubiquitous in the data space. That being said, I have stayed away from doing any IaC work for as long as I can remember, mainly due to its aura of being difficult and also because I could pass the ball to the platform team.

#datasky #databs

10.01.2025 01:44 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0

(4/4) Some of the #saas products are amazing. In fact, I continue to use them where relevant. However, it is important that you do your due diligence when it comes to picking your tech stack so that you do not miss out on something better. Your future self will be grateful!

07.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

(3/4) One of the key benefit of using dlthub is that you can run it anywhere with a #python runtime, i.e. airflow, dagster, serverless etc. There is also the added flexibility to bring unique functionality that are specific to your data into the ingestion framework.

07.01.2025 16:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(2/4) This last year, I came across a python library called #dlthub, which I use regularly when it comes to data ingestion. What I like about dlthub is how it seamlessly integrate with widely used sources and destinations, and allows you to build production ready data pipelines in just a few lines.

07.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

(1/4) What do you use for #data ingestion?

Its true that there are no shortage of tools when it comes to data ingestion. But before you open the wallet to one of the many options out there, it might be worth doing a thorough due diligence based on your current and future needs.

#databs

07.01.2025 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0