SightSpeak AI Blog

Home / AI Blog

Integrating Snowflake and Databricks (patterns & a sample pipeline)

Hey community! Here are patterns for combining Snowflake (storage & compute) with Databricks (Spark-based engineering & ML). Example: Databricks processes raw events, writes Delta/Parquet to cloud storage, and Snowflake ingests via External Tables or Snowpipe.

Intro & Patterns

  • Pattern A: Databricks performs ETL/ML → writes Delta/Parquet → Snowflake reads via External Tables or Snowpipe.
  • Pattern B: Snowflake stores curated data + Snowpark transformations; Databricks reads via Snowflake Connector or Delta Sharing.
    Reference: Snowflake Documentation

Sample Pipeline (Step-by-Step)

1) Databricks processes raw data


df = spark.read.json("s3a://raw-events/")
clean = df.filter(df.event_type.isNotNull())
clean.write.format("delta").mode("overwrite") \
     .save("s3a://curated/events_clean/")

2) Snowflake ingests curated data


CREATE OR REPLACE EXTERNAL TABLE curated_events (
  event_type STRING,
  event_ts TIMESTAMP,
  user_id STRING
)
WITH LOCATION = '@my_stage/events_clean/'
FILE_FORMAT = (TYPE = PARQUET);

3) Optional: downstream analytics using Snowpark


from snowflake.snowpark import Session
from snowflake.snowpark.functions import count

session = Session.builder.configs(config).create()
df = session.table("curated_events")
summary = df.group_by("event_type").agg(count("*").alias("total"))
summary.show()

Operational Tips

  • Choose a single canonical storage layer—Snowflake or Delta—not both simultaneously.
  • Use Delta Sharing or Snowflake Secure Data Sharing for cross-platform distribution.
  • Monitor egress costs when moving data across platforms.

Next Steps

  • Build a GitOps-style repo for managing Databricks notebooks and Snowpark code.
  • Implement monitoring using Databricks Jobs Alerts and Snowflake Resource Monitors.

References

Published: 5 days ago

By: puja.kumari