Hey community! Here’s a walkthrough on running PySpark in Databricks notebooks, connecting your local IDE using Databricks Connect, and executing SQL using the Databricks SQL connector.
Intro
Databricks provides an optimized Apache Spark compute engine with collaborative notebooks and production-ready pipelines. Python is the most widely used language on the platform, supporting ETL, ML, and analytics workloads.
Databricks Documentation
Prerequisites
- Databricks workspace & cluster.
pip install databricks-sql-connector
pip install databricks-connect (optional, for remote execution)
1) Databricks Notebooks — fastest path to productivity
df = spark.read.option("header", "true") \
.csv("/dbfs/FileStore/tables/sample.csv")
display(df.limit(10))
from pyspark.sql.functions import col
agg = df.groupBy("country") \
.count() \
.orderBy(col("count").desc())
display(agg)
Learn more:
Databricks Quick Start
2) Databricks Connect — local IDE, remote compute
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("s3a://my-bucket/sample.csv", header=True)
print(df.count())
Documentation:
Databricks Connect
3) Databricks SQL Connector — programmatic SQL from Python
from databricks import sql
with sql.connect(
server_hostname="adb-xxxxx.azuredatabricks.net",
http_path="sql/protocolv1/o/123456789/0123-456789-abcde",
access_token="DATABRICKS_TOKEN"
) as conn:
with conn.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM my_schema.my_table")
print(cur.fetchone())
Documentation:
Databricks SQL Connector
Best practices
- Match Databricks Connect client and cluster runtime versions.
- Use Unity Catalog for governance.
- Prefer Delta Lake for ACID performance and scalable storage.
Next steps
- Build an ETL pipeline that reads from S3, transforms using PySpark, and writes to Delta Lake.
- Use Databricks Jobs to schedule production workflows.
References