Hey community! Here are patterns for combining Snowflake (storage & compute) with Databricks (Spark-based engineering & ML). Example: Databricks processes raw events, writes Delta/Parquet to cloud storage, and Snowflake ingests via External Tables or Snowpipe.
Intro & Patterns
- Pattern A: Databricks performs ETL/ML → writes Delta/Parquet → Snowflake reads via External Tables or Snowpipe.
- Pattern B: Snowflake stores curated data + Snowpark transformations; Databricks reads via Snowflake Connector or Delta Sharing.
Reference: Snowflake Documentation
Sample Pipeline (Step-by-Step)
1) Databricks processes raw data
df = spark.read.json("s3a://raw-events/")
clean = df.filter(df.event_type.isNotNull())
clean.write.format("delta").mode("overwrite") \
.save("s3a://curated/events_clean/")
2) Snowflake ingests curated data
CREATE OR REPLACE EXTERNAL TABLE curated_events (
event_type STRING,
event_ts TIMESTAMP,
user_id STRING
)
WITH LOCATION = '@my_stage/events_clean/'
FILE_FORMAT = (TYPE = PARQUET);
3) Optional: downstream analytics using Snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import count
session = Session.builder.configs(config).create()
df = session.table("curated_events")
summary = df.group_by("event_type").agg(count("*").alias("total"))
summary.show()
Operational Tips
- Choose a single canonical storage layer—Snowflake or Delta—not both simultaneously.
- Use Delta Sharing or Snowflake Secure Data Sharing for cross-platform distribution.
- Monitor egress costs when moving data across platforms.
Next Steps
- Build a GitOps-style repo for managing Databricks notebooks and Snowpark code.
- Implement monitoring using Databricks Jobs Alerts and Snowflake Resource Monitors.
References