Week 7 #DataEngineering Zoomcamp 🏎️
Streamed 4.4M records via #Redpanda & #PySpark on my James-T-850. Speed is nothing without logic!
Results: 📍
📏 Dist: 9506
🏙️ Zone: 74
⏳ Session: 31m
💰 Peak Tip: 10-16 18:00
Progress: github.com/CodingJhames...
#Streaming #Python #BigData #DataTalksClub
Module 6 complete: Batch Processing with Spark by
@datatalks.bsky.social
Spark correctness depends on details like partitioning and timestamp handling, not only on writing transformation
#DataEngineering #Spark #PySpark #BatchProcessing
Data Engineering Week 5: Done! 🏁
Pivoted to #AWS from GCP. ☁️
Ran #PySpark on a t3.micro (1GB RAM) using a 4GB Swapfile. Processed NYC Taxi data smoothly without crashes. 🧠
Adaptability > Tools. 🦾
Code: github.com/CodingJhames...
#DataEngineering #Spark #AWS #OpenSource
How to Write PySpark Code That Prevents Broadcast Joins from Blowing Up Executors Defensive patterns to control broadcast size, memory pressure, and executor stability in PySpark jobs Continue read...
#data-science #big-data #machine-learning #python #pyspark
Origin | Interest | Match
Last week, we nailed an epic internal meetup: "From Pandas to PySpark: Thinking Spark-native" 🐼✨
Beyond code tweaks – full mindset shift to scalable PySpark workflows. Pro tips for distributed speed boost!
These sessions level up our data game. 🚀
#DataEngineering #PySpark
Hello data scientists,
Here is an article published, LinkedIn Data Scientist PySpark (Hard — Level) Interview Problem — Solution in Detail Steps.
#DataScientists #PySpark #DataEngineers #Data #Medium #Articles #DataEngineering
medium.com/meanlifestud...
Hello Data Engineers,
Here is an article published on Amazon Data Engineer PySpark (Medium — Level) Interview Question — Solution.
#dataengineers #pyspark #dataengineering #dataanalytics #bigdata #medium #articles
medium.com/meanlifestud...
📌 3 ejemplos de cómo el rendimiento de las UDFs en PySpark mejora considerablemente con #Arrow
⚙️ La integración de Arrow en #Spark 3.5 elimina el costoso proceso de serialización entre la JVM y #Python optimizando así las UDFs
➡️ blog.damavis.com/como-optimiz...
#BigData #PySpark
⚙️ Las UDFs permiten personalizar las operaciones que se realizan con datos
🐍 Se definen en #Python y se aplican a las columnas de un DataFrame
⚠️ Su uso puede suponer una pérdida de rendimiento si no se optimizan de la forma adecuada
➡️ blog.damavis.com/como-optimiz...
#PySpark #Arrow
🚀 New Lab Replay: Using Delta Tables in Apache Spark (Microsoft Fabric)
🎥 Watch the full session:
👉 www.youtube.com/live/gT21FS8...
#MicrosoftFabric #DeltaTables #ApacheSpark #DeltaLake #DP600 #DP700 #Lakehouse #DataEngineering #BigData #ACID #TimeTravel #SparkSQL #PySpark #MicrosoftLearn
🔥 New Lab Replay: Analyze Data with Apache Spark in Microsoft Fabric
🎥 Watch the full lab session:
👉 www.youtube.com/live/lsv2Oi8...
#MicrosoftFabric #ApacheSpark #SparkAnalytics #DP600 #DP700 #Lakehouse #PySpark #DeltaTables #BigData #DataEngineering #Analytics #FabricCommunity
Ya disponible La Experimental #14
🌐 Tendencias #web
💻 Gestión de #Git hooks
🧑🏻💻 Diseño #TUI con #GoLang
🐍 #Python sin GIL
💾 Guía de #PySpark SQL
🤖 Agente #IA local
🐧 Guía de seguridad #Linux
🌩️ Monitorización #SelfHosted
💼 Informe laboral #Tech de #manfred
Link: open.substack.com/pub/laexperi...
Building a Modern Data Platform to Track Kenya’s Food Prices — A Data Engineering Case Study Food price volatility has always been a sensitive issue across Kenya. From urban households in Nairo...
#spark #pyspark #grafana #dataengineering
Origin | Interest | Match
Y isn't #Rust replacing #scala and #pyspark as the main functional language in #spark? Is there an alternative to #spark that is built on #rust?
What is the default engine used in Fabric Notebooks?
The default engine is PySpark, which runs on top of the Apache Spark engine.
#MicrosoftFabric #FabricNotebooks #PySpark #ApacheSpark #BigData #DataEngineering #PowerBI #DataPlatform #OneLake #FabricCommunity #DP700 #SparkEngine #DataProcessing
What languages can be used in Fabric Notebooks?
Microsoft Fabric Notebooks support:
🔹 PySpark
🔹 Spark (Scala)
🔹 SparkSQL
🔹 SparkR (R)
🔹 HTML
#MicrosoftFabric #FabricNotebooks #PySpark #SparkSQL #SparkR #Scala #BigData #DataEngineering #DataScience #OneLake #FabricCommunity #DataPlatform #DP700
📣 Missed the community meetup from July 17nd, with Jared Kuehn and Ronen Ariely?
🚀 Dive into #PySpark in #Microsoft #Fabric with Jared Kuehn - a powerhouse speaker and veteran data engineer - as he demystifies how to work with PySpark in #MicrosoftFabric.
youtu.be/Y4Uxnj0CAeA?...
What are Fabric Notebooks best suited for?
They’re ideal for:
🔹 Handling large external datasets
🔹 Performing complex data transformations
🔹 Running custom code in languages like PySpark, SQL, or Scala
#MicrosoftFabric #FabricNotebooks #PySpark #BigData #DataTransformation #DataEngineering #PowerBI
📈 Monitoriza tus métricas con #Spark y #Prometheus
1️⃣ Requisitos previos
2️⃣ #Pyspark
3️⃣ JMX Exporter: ¿Qué es y cómo se configura?
4️⃣ Ejecución de Spark
5️⃣ Configuración de Prometheus
➡️ blog.damavis.com/integracion-...
#ApacheSpark #BigData #DataEngineering
Unlock the power of #PySpark in #Microsoft #Fabric with Jared Kuehn!
Learn #Spark management, #Python tips, and boost #performance in this live event 🚀
🗓 July 17, 12 PM EDT
🎤 Hosted by Ronen Ariely @pitoach.bsky.social
👉 www.meetup.com/cloud-data-d...
#MicrosoftFabric #DataEngineering
🚀 Starting a new series: #PySpark + #AI
What happens when distributed computing meets intelligent automation?
I'm documenting hands-on work integrating PySpark with ML & LLMs (LangChain, Azure, etc).
Let's bridge Big Data + Smart Logic.
#DataScience #MLOps #LLM #BigData
PySpark: Read CSV like a pro
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show(3)
✅ Auto schema
✅ Header as columns
✅ Ready to transform
Small win, big impact.
#PySpark #DataEngineer #BigData #xavierdatatech
🚀 Working with #PySpark in the cloud — juggling multiple #DataFrames in parallel.
🔍 Combining filter(), select(), and join() efficiently is teaching me how to optimize both loading and exploration on large datasets.
#BigData #Databricks #DataEngineering #ApacheSpark
🚀 Unlocking Big Data Potential with PySpark!
Key Features:
🔹 Spark SQL
🔹 Spark MLlib
🔹 Spark Streaming
🔹 DataFrame API
#PySpark #BigData #DataScience #ApacheSpark #MachineLearning #DataEngineering #XavierDataTech
Top 8 Data Visualization Libraries
#Python
#PySpark #SQL #BigData #Databricks #BusinessIntelligence #DataEngineering #PowerBI #DataAnalytics #SparkSQL #XavierDataTech
🚀 Working with PySpark SQL? Here's a quick and powerful example!
You can query DataFrames using SQL syntax in Spark — great for teams coming from SQL backgrounds.
#PySpark #BigData #SparkSQL #DataEngineering #ETL #ApacheSpark #SQL #DataScience #XavierDataTech
Supported chart types: scatter, line, bar, area, pie, histogram, box, and KDE — optimized for Spark performance with smart sampling.
#PySpark #BigData #AI #DataVisualization #Spark40 #DataScience #MLOps #XavierDataTech #Databricks
databricks.com/blog/pyspark-n…
🚀 PySpark in the Cloud:
💾 DataFrames · Delta Lake · Databricks
📊 Power BI Export · Semantic Layer via LangChain
🔁 Real-world pipelines, hands-on.
🔗 linkedin.com/in/xavier-mareca
#PySpark #BigData #DataEngineering #PowerBI #LangChain #Azure #AI