SightSpeak AI Blog

Home / AI Blog

Python on Databricks: notebooks, Databricks Connect & SQL connector

Hey community! Here’s a walkthrough on running PySpark in Databricks notebooks, connecting your local IDE using Databricks Connect, and executing SQL using the Databricks SQL connector.

Intro
Databricks provides an optimized Apache Spark compute engine with collaborative notebooks and production-ready pipelines. Python is the most widely used language on the platform, supporting ETL, ML, and analytics workloads. Databricks Documentation

Prerequisites

  • Databricks workspace & cluster.
  • pip install databricks-sql-connector
  • pip install databricks-connect (optional, for remote execution)

1) Databricks Notebooks — fastest path to productivity


df = spark.read.option("header", "true") \
               .csv("/dbfs/FileStore/tables/sample.csv")
display(df.limit(10))

from pyspark.sql.functions import col
agg = df.groupBy("country") \
        .count() \
        .orderBy(col("count").desc())
display(agg)

Learn more: Databricks Quick Start

2) Databricks Connect — local IDE, remote compute


from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("s3a://my-bucket/sample.csv", header=True)
print(df.count())

Documentation: Databricks Connect

3) Databricks SQL Connector — programmatic SQL from Python


from databricks import sql

with sql.connect(
    server_hostname="adb-xxxxx.azuredatabricks.net",
    http_path="sql/protocolv1/o/123456789/0123-456789-abcde",
    access_token="DATABRICKS_TOKEN"
) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT COUNT(*) FROM my_schema.my_table")
        print(cur.fetchone())

Documentation: Databricks SQL Connector

Best practices

  • Match Databricks Connect client and cluster runtime versions.
  • Use Unity Catalog for governance.
  • Prefer Delta Lake for ACID performance and scalable storage.

Next steps

  • Build an ETL pipeline that reads from S3, transforms using PySpark, and writes to Delta Lake.
  • Use Databricks Jobs to schedule production workflows.

References

Published: 5 days ago

By: puja.kumari