Anomaly Detection using Sigma Rules (Part 1): Leveraging Spark SQL Streaming
Sigma rules are used to detect anomalies in cyber security logs. We use Spark structured streaming to evaluate Sigma rules at scale.- 24795Murphy ≡ DeepGuide
Anomaly Detection using Sigma Rules (Part 2) Spark Stream-Stream Join
A class of Sigma rules detect temporal correlations. We evaluate the scalability of Spark's stateful symmetric stream-stream join to...- 29907Murphy ≡ DeepGuide
Anomaly Detection using Sigma Rules (Part 3) Temporal Correlation Using Bloom Filters
Can a custom tailor made stateful mapping function based on bloom filters outperform the generic Spark stream-stream join?- 29885Murphy ≡ DeepGuide
Hands-On Introduction to Delta Lake with (py)Spark
Concepts, theory, and functionalities of this modern data storage framework- 26748Murphy ≡ DeepGuide
Anomaly Detection using Sigma Rules (Part 4): Flux Capacitor Design
We implement a Spark structured streaming stateful mapping function to handle temporal proximity correlations in cyber security logs- 23288Murphy ≡ DeepGuide
Creating a Data Pipeline with Spark, Google Cloud Storage and Big Query
On-premise and cloud working together to deliver a data product- 27982Murphy ≡ DeepGuide
Anomaly Detection using Sigma Rules (Part 5) Flux Capacitor Optimization
To boost performance, we implement a forgetful bloom filter and a custom Spark state store provider- 23345Murphy ≡ DeepGuide
Anomaly Detection Using Sigma Rules: Build Your Own Spark Streaming Detections
Easily deploy Sigma rules in Spark streaming pipelines: a future-proof solution supporting the upcoming Sigma 2 specification- 24305Murphy ≡ DeepGuide
Optimizing Output File Size in Apache Spark
A Comprehensive Guide on Managing Partitions, Repartition, and Coealesce Operations- 25814Murphy ≡ DeepGuide
Parallelising Python on Spark: Options for concurrency with Pandas
Photo by Florian Steciuk on Unsplash In my previous role, I spent some time working on an internal project to predict future disk storage space usage for our Managed Services customers across thousands of disks. Each disk is subject to its own usage patte- 28871Murphy ≡ DeepGuide
1.5 Years of Spark Knowledge in 8 Tips
My learnings from Databricks customer engagements- 24785Murphy ≡ DeepGuide
Unleashing the Power of SQL Analytical Window Functions: A Deep Dive into Fusing IPv4 Blocks
How to summarize a geolocation table by merging contiguous network IPv4 blocks- 29301Murphy ≡ DeepGuide
End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker
Building a Practical Data Pipeline with Kafka, Spark, Airflow, Postgres, and Docker- 28845Murphy ≡ DeepGuide
Performant IPv4 Range Spark Joins
A Practical guide to optimizing non-equi joins in Spark- 25023Murphy ≡ DeepGuide
4 Examples to Take Your PySpark Skills to Next Level
Get used to large-scale data processing with PySpark- 22645Murphy ≡ DeepGuide
Building a Semantic Book Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR…
Using OpenAI's Clip model to support natural language search on a collection of 70k book covers- 25288Murphy ≡ DeepGuide
Delta Lake – Type widening
What is type widening and why does it matter?- 27192Murphy ≡ DeepGuide
Apache Hadoop and Apache Spark for Big Data Analysis
A complete guide to big data analysis using Apache Hadoop (HDFS) and PySpark library in Python on game reviews on the Steam gaming...- 21411Murphy ≡ DeepGuide
Feature Engineering for Time-Series Using PySpark on Databricks
Discover the potentials of PySpark for time-series data: Ingest, extract, and visualize data, accompanied by practical implementation codes- 27189Murphy ≡ DeepGuide
Orchestrating a Dynamic Time-series Pipeline in Azure
Explore how to build, trigger, and parameterize a time-series data pipeline with ADF and Databricks, accompanied by a step-by-step tutorial- 20805Murphy ≡ DeepGuide
1 2
We look at an implementation of the HyperLogLog cardinality estimati
Using clustering algorithms such as K-means is one of the most popul
Level up Your Data Game by Mastering These 4 Skills
Learn how to create an object-oriented approach to compare and evalu
When I was a beginner using Kubernetes, my main concern was getting
Tutorial and theory on how to carry out forecasts with moving averag