Nightly Closures

This one simple trick lets you stream with dbt and Databricks

Kyle Valade — Wed, 19 Nov 2025 20:27:01 GMT

Are you using Databricks and want to use dbt, but you're not sure how to do it with all of your big data? Like, dbt can't seem to stream, and you need to stream because anything else is too expensive or too much work or too flaky?

I've used this pattern and seen it used across several teams of data people reading through petabytes-ish of data.

dbt doesn't natively support streaming, unless you use Databricks streaming tables or materialized views, which I don't really like to do because they have always failed me eventually. They start out great, but as soon as something happens that you want to debug, or you need to rename your table or something, then you're SOL. So I don't use them - I use plain old streaming. (No offense to any Databricks folks)

Some people might not realize that when your dbt python notebook code gets run, it's really just uploaded as a notebook. The default dbt function (model()), takes the result and feeds it into a table.

SO since it's just a notebook, you can stream...it's just a matter of getting dbt to not overwrite your existing table's data, and checkpointing, making sure you get alerts if things fail, setting timeouts, etc. And that turns out to be pretty simple.

So the one simple trick is....~~hey, have you considered supplements?~~

Create an empty dataframe with the schema you want, and have model() return that empty dataframe so that NOTHING is appended to your table.

Use a workflow_job because it breaks you out of a dbt sandbox and lets you use everything from Databricks jobs. Otherwise they jam you into these one-time runs. And because, shameless plug, I wrote most of it, though as an act of frustration, and with a bunch of help from Databricks (shoutout Ben).

Is that just a wall of text? I hope not. I'll add fun memes one day. Until then, check out this wall of code.

This is a simple example. It streams a log file table and appends new entries to a new table. Substitute log file for telemetry or other things you might be interested in. I'm going to be honest - I haven't run this exact code, so let me know if anything is wrong.

# COMMAND ----------

# MAGIC %pip install Pillow

# Just an example for pip installing...

dbutils.library.restartPython()

# COMMAND ----------

import pyspark.sql.types as T  
import pyspark.sql.functions as F


def get_or_create_checkpoint_location(dbt):  
    """Create checkpoint location for streaming query."""
    create_volume_query = f"CREATE VOLUME IF NOT EXISTS {dbt.this.database}.{dbt.this.schema}.checkpoints"
    print("Create volume query", create_volume_query)
    spark.sql(create_volume_query)
    return f"/Volumes/{dbt.this.database}/{dbt.this.schema}/checkpoints/{dbt.this.identifier}"

def run_stream(dbt):  
    """Run streaming query to process new log files."""
    checkpoint_location = get_or_create_checkpoint_location(dbt)
    output_location = str(dbt.this)

    log_files_df = (spark.readStream
      .format('delta')
      .option('ignoreChanges', 'true')
      .option('ignoreDeletes', 'true')
      .table('my_catalog.schema.log_files')
    )

    write_stream = (
      log_files_df.writeStream.format("delta")
      .outputMode('append')
      .option("checkpointLocation", checkpoint_location)
      .option("mergeSchema", "true")
      .trigger(availableNow = True)
      .start(output_location)
    )

def model(dbt, session):  
    """Main entry point for dbt model."""
    dbt.config(
        materialized='incremental',
        submission_method='workflow_job'
    )

    # Define output schema
    output_schema = T.StructType([
        T.StructField('log_entry_id', T.StringType(), False),
        T.StructField('log_file_id', T.IntegerType(), False),
    ])

    df = spark.createDataFrame(data=spark.sparkContext.emptyRDD(), schema=output_schema)

    if not dbt.is_incremental:
        # Create table if it doesn't exist
        df.write.saveAsTable(str(dbt.this), mode='append')

    # Run streaming query
    run_stream(dbt)

    return df

The yaml is going to be something like:

version: 2

models:  
  - name: int_device_logs

    config:
#      cluster_id: afsfs-1232819-dsfbkjbs1  # Use this when developing
      python_job_config:
        timeout_seconds: 3600
        email_notifications: { on_failure: ["me@example.com"] }
        max_retries: 2

        name: my_workflow_name

        # Override settings for your model's dbt task. For instance, you can
        # change the task key
        additional_task_settings: { "task_key": "my_dbt_task" }

        # Define tasks to run before/after the model
        # This example assumes you have already uploaded a notebook to /my_notebook_path to perform optimize and vacuum
        post_hook_tasks:
          [
            {
              "depends_on": [{ "task_key": "my_dbt_task" }],
              "task_key": "OPTIMIZE_AND_VACUUM",
              "notebook_task":
                { "notebook_path": "/my_notebook_path", "source": "WORKSPACE" },
            },
          ]

        # Simplified structure, rather than having to specify permission separately for each user
        grants:
          view: [{ "group_name": "marketing-team" }]
          run: [{ "user_name": "other_user@example.com" }]
          manage: []
      job_cluster_config:
        spark_version: "16.4.x-scala2.12"
        node_type_id: "c2-standard-4"
        runtime_engine: "STANDARD"
        data_security_mode: "SINGLE_USER"
        single_user_name: "aghodsi@databricks.com"
        autoscale: { "min_workers": 1, "max_workers": 1 }

You can override pretty much anything in a job here.

If you want to pip install a package in your notebook, use # MAGIC %pip install Pillow or whatever. If you go with the regular %pip install my-package, the arbitrarily vindictive dbt python parser is going to get you.

There are some fine people at dbt, but my feelings on the company are for another post - especially now that they are safely with fivetran instead of (insolvently?) independent.

Automating tedium with the ServiceNow API

Kyle Valade — Fri, 12 Jan 2024 17:25:07 GMT

I should have named this blog Enterprise . But big companies are going to use ServiceNow whether or not HR likes handling onboarding tickets in the side-project you wrote in beautiful, idiomatic Go. But that doesn't mean that you personally need to log in every time a ticket comes your way.

So, say you want to automate whatever happens if someone files a ServiceNow ticket to, say, get access to your pristine Database - surely there's an API to facilitate that? Yes, there is. The API doesn't seem to be super well-documented, or if it is, that documentation is buried under layers of other ServiceNow jargon.

The API is organized kind of like a relational database, so it's very flexible, maybe to a fault. But it's not so bad once you get into it. There are a lot of undocumented attributes that can be accessed by dot-walking your way through real or imagined properties.

First, there is a "REST API Explorer" that you'll need an admin to grant you access to. It would be at a URL like this https://your-company.service-now.com/now/nav/ui/classic/params/target/%24restapi.do. You'll also want an API user + password...unfortunately there doesn't seem to be much in terms of permissions. An API user is able to do pretty much anything.

In the REST Explorer, you can explore all of the tables and make some sample requests. Paired with the Task List UI (where you can search for all tasks in ServiceNow), you can build the search params you want, then right-click the breadcrumbs and copy the params to use in the sysparm_query arg below.

My flow:

Get all open tickets
- /sc_task to get the Catalog tasks (ie - SCTASK00123)
- Can filter by category item - for example ?sysparm_query=assignment_group.name=Data Team Requests
- Using the sysparm_fields param, make sure to also pull the request_item.sys_id because we'll need that below
For each ticket, we need to populate the values for any custom created fields
- This is where the real dot-walking comes in. Hit /sc_item_option_mtom with sysparm_query= request_item.sys_id= and select the following fields:

"sys_id",
"sc_item_option.value",
"sc_item_option.sys_id",
"sc_item_option.item_option_new.sys_id",
"sc_item_option.item_option_new.sys_name",
"sc_item_option.item_option_new.question_text",
"sc_item_option.item_option_new.question.value"

For each of those fields, if those are a multiple choice question, the API will only return a sys_id as the response. That means you need to hit another endpoint to get the actual value. My implementation is pretty quick and dirty, so I'm just doing a nested loop, but I'm sure you could pre-populate these values if you cared.
- Hit the /question_choice endpoint with the sysparm_query=sys_id=
- Can combined multiple ids with an OR like sysparm_query=sys_id=123ORsys_id=456
PATCH the ticket to close with a comment

PATCH /sc_task/  
{
    "state": "3",
    "comments": "G'day"
}

Trying to specifically add a close note with close_notes didn't work but I figure it's close enough.

There - not so bad! There have surely been worse APIs in the history of the universe.

Eternities of Gitlab CI pipeline trial + error...only run with changes

Kyle Valade — Mon, 03 Apr 2023 23:22:30 GMT

I've spent many mind-numbing hours trying to get this Gitlab CI pipeline to work the way I want it to work. I'm not sure if I'm doing something wrong, or if GitlabCI is doing something wrong, but I'm becoming more and more convinced that it's GitlabCI.

Essentially, we have many different projects in our repo, separated by dirs. If nothing in a project changes, I don't want it to go through the CI process because it's a waste of time.

Luckily, GitlabCI solves for this...right...? I mean, yeah, you can define variables so they should work everywhere like you'd expect, right?

I started with something like this. Imagine like 10 of the one-project--plan all for different projects. I wanted to break out the only:changes part so that I don't have to repeat it everywhere.

one-project--plan:  
  stage: terraform-plan
  extends:
    - .install-cli-and-assume-role
    - .terraform-plan-only
  variables:
    TF_DIR: path/to/my/project
    WORKSPACE_NAME: my-tf-project-workspace-one

.terraform-plan-only:
  script:
    - cd "${TF_DIR}"
    - terraform init
    - terraform plan
  except:
    - master
  only:
    changes:
      - '${TF_DIR}/**/*'

Great - that makes sense, right? Sure, but it doesn't work. Hardcode the path, though, and it works. Ok, that's weird, but at least we have a baseline to work from.

A little digging shows that only is compiled at a different time than the rest of it, so you can't use variables in only/except. The recommended approach is to switch to using rules. Rules are fine. Not as nice-looking as only/except, and the logic is OR for some reason, but whatever.

Next iteration:

.terraform-plan-only:
  script:
    - cd "${TF_DIR}"
    - terraform init
    - terraform plan
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
      changes:
        - $TF_DIR/**/*

Imagine ~100 commits where I mess with various ways to quote that variable.

This link in the docs shows something disconcertingly similar:

docker build:  
  variables:
    DOCKERFILES_DIR: 'path/to/files'
  script: docker build -t my-image:$CI_COMMIT_REF_SLUG .
  rules:
    - changes:
        - $DOCKERFILES_DIR/*

There are also threads like this one in the Gitlab repo showing that something like this is supposed to work.

But it doesn't work.

I'm guessing it doesn't work because it's an inherited configuration entry and because rules:changes seems to be a sort of special case for Gitlab CI. Part of which means that it can't use variables that are defined in a stage, but only more global variables. But who knows. I'm just going to repeat that segment for every different project, which is...fine. Just fine.

TLDR;

This works:

variables:  
  MY_PROJECT: path/to/project

.terraform-plan-only:
  script:
    - cd "${TF_DIR}"
    - terraform init
    - terraform plan
  variables:
    TF_DIR: $MY_PROJECT
    WORKSPACE_NAME: my-project
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
      changes:
        - $MY_PROJECT/**/*

but this doesn't work:

variables:  
  MY_PROJECT: path/to/project

.terraform-plan-only:
  script:
    - cd "${TF_DIR}"
    - terraform init
    - terraform plan
  variables:
    TF_DIR: $MY_PROJECT
    WORKSPACE_NAME: my-project
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
      changes:
        - $TF_DIR/**/*

I hope you ran across this when there's still some of your day to get back.

Jest - mock a single function in a module

Kyle Valade — Wed, 14 Dec 2022 23:48:07 GMT

Though the requireActual route looked promising, it only worked when I called the mocked function directly in my test, and not when used like in real life.

This answer on StackOverflow ended up helping, as did this discussion on GitHub.

It's definitely not beautiful. Almost makes me miss python mocks.

Here's an example cribbed from that SO:

// myFunctions.ts
import * as thisModule from './module';

export function world() {  
    return 'world';
}

export function hello() {  
    const value = world();
    return `hello ${thisModule.world()}`
}

// myFunctions.spec.ts
import * as theModule from './myFunctions';

describe('function mock test', () => {  
  it('should change that function output', async () => {
    jest.spyOn(theModule, 'world').mockReturnValue('NOT WORLD!')
    ...
  });
});

MLOps, CI, and development flows using Databricks

Kyle Valade — Thu, 04 Feb 2021 16:08:10 GMT

I wrote this about a year or two ago while I was still working a lot with Databricks. It was right around the time where "MLOps" and "ML Engineering" were starting to get kind of hip and posting then probably would have been a better idea 🤪.

I was very frustrated by the machine learning workflow vs. what I am used to while developing web apps - it honestly felt like stepping back in time about 10-15 years. Like, you know, when you're learning to program and before you know about source control or automated testing. Except much slower because there is a ton of data that the ML training has to slog through. And the data science code that I've seen, while genius in its way, hasn't been written with maintainability in mind. Anyways, the good folks at Databricks pointed me to some resources and talked me through what they often saw, and I also rolled up my own sleeves. Hopefully this is still relevant for you.

Working with notebooks is a lot different than your standard dev environment, and the notebooks themselves make it easy to get lazy and have a bunch of spaghetti code lying around. Plus, you have to figure out a whole different way to work, integrate with source control, integrate with others, and run your automated tests.

We’ve gone through a couple of iterations for organizing projects, and have talked to Databricks about how they structure their own. This is what we’ve come up with and what we’ve been using for our ML project. So far, we're pretty happy with it - it’s not much different than a lot of Git-based development workflows + it has scaled well so far beyond one developer + across multiple different environments.

This is intended as a framework for future machine learning projects. No doubt there is room for improvement, so feel free to experiment and report back.

Workflow

In terms of a workflow, I’m a fan of something similar to GitHub Flow (or there are a million variants).

The flow goes roughly like:

Open a branch per change
Commit your code into that branch
Open a pull request in GitHub
Go through some code review
Make changes (or not)
Merge
Deploy
Repeat

Structure

Here is our ideal directory structure for the project.

|ProjectName/  -- the project root
----| azure-pipelines.yml
----| requirements.txt
----| README.md
----| .gitignore
----| scripts/
----| tests/
----| forecasting
--------| __init__.py
--------| conf/
--------| ModelTraining.py
--------| set_environment.py

A lot of these are explained in the Databricks section, so I’ll just go over the rest of them here:

azure-pipelines.yml - this is the config for our CI system (Azure Pipelines)
requirements.txt is your standard python file for listing your project’s requirements. It’ll be used by your package manager.
README.md - standard git repo readme. .md is a markdown file
.gitignore - all of the files you don’t want to store in source control
tests - the root for all of your tests. There are other places to put this, and it’s mostly a matter of preference.
- The way everything else is set up, if you put it under forecasting, then you'll send it up to Databricks all the time, so I kept it outside.
scripts - if you have any helpful scripts for your project (like deploying your code to Databricks (see script in appendix)), this is the place for those
forecasting - the top level python module. It contains all of the notebooks. Name it something that will make sense when you're importing it in your code.

There's some room for debate here...for this project, we kept the machine learning library separate from this code, since we think it'll have some broader use. But if you don't think that's the case, you might want to rename forecasting to be something like forecasting_notebooks, and then your machine learning library source would be a separate dir under root.

Databricks

Workflow

Databricks is great, but we’ve found it’s better to pull a lot of your code out into regular python packages and import them into your notebook. That’s better because:

You can write + run automated tests locally
Refactoring is much easier in a real IDE
We had a lot of trouble pickling classes that were defined in notebooks - so you’ll need to do this for a lot of machine learning
Fiddling with the Databricks-git integration is a bit of a pain
It’s easier to treat your Databricks environment as disposable + see Git as the source of truth.
- Not sure if the Databricks code is out of date? Just overwrite it with your local copy

This is how we import our package at the top of our databricks notebook:

dbutils.library.install("dbfs:/libraries/StructuralChangeDetectionModel-0.4811.2-py3.7.egg")  
dbutils.library.restartPython()

Databricks directory structure

Looking at the directory structure from within Databricks, it looks like this. It mostly mirrors the structure of your project on GitHub except for the autogenerated CI directory, and where your local code is written.

|ProjectName/ - should be accessible by all people in the project - ie, not in your personal folder
----| master/ - the production copy of your project
--------| conf/
------------| production_config.py
------------| ci_config.py
------------| kyle_config.py
------------| joe_config.py
------------| ci_set_environment.py
--------| set_environment.py
--------| Model training notebook.py
----| CI/ - where your CI script will store all of its builds - these are autogenerated
--------| 49-master/  - the build number + the name of the branch being built, for example
--------| 37-BI-5613/ 
----| libraries/
----| BI-4810/ - not part of source control; feature/dev/bugfix branch
----| BI-5613/ - not part of source control; feature/dev/bugfix branch

To explain…

ProjectName - Your Databricks project root. Obvs name this something that makes sense for your project.
- Contains all of your project files - nothing for your project should be outside of this (in Databricks)
master - contains the project files that should be used for production. Mirrors the directory in GitHub where your code lives
conf - contains files that define the constants that you’ll be using for your project on a per-environment (and/or per-developer) basis. Things are split up so that it’s easy to switch between environments.
- The files in here shouldn't contain a bunch of util functions or anything - those can be defined in a real notebook or in your library. This should pretty much just be constants - especially the ones that change between prod and dev
production_config - the base config. I prefer when everything inherits from production because you can see exactly what changed and there’s not as much duplication. But other people like doing a base_config and then branching off from there. Either way, this will contain all of the constants you need for your prod environment.
ci_set_environment - this is just used by the CI script - it sets the environment for the CI job
ci_config/kyle_config/joe_config - Environment-specific settings. Ie - the CI environment vs Kyle’s dev environment vs Joe’s dev environment. You can set the delta files to your own sandbox, instead of overwriting prod, etc.
set_environment - this is a one-line notebook that specifies which conf file to use
BI-4810 + BI-5613 + etc… - These aren't stored in source control. When you send your local code up to Databricks, it would write everything into these directories (see the script at the very bottom of this post). These are just examples of feature branch/bug fix folders - feel free to name them whatever you want. I personally like using the issue number.
- It’s basically the same as master except some of the files will be different, because you’re adding a feature or fixing bugs.
- Another option is to use separate folders for each of your developers (ie. a KyleDev + JoeDev folder). This is actually what Databricks suggested, but I don't really like the option because...
  - I found that it doesn’t work super well if multiple people are working on the same issue. For instance, if you have a data engineer + a data scientist.
  - Having separate folders per branch mirrors GitHub flow closer, which is nice.
  - On long-lived branches, you’re more likely to diverge from master and fall into bad habits.
CI - the folders in here will be automatically created by your CI script, if you have that set up. It comes in handy if you’re trying to fix an issue that caused integration tests to fail.
libraries - contains dependencies that need to be added to your cluster - should contain the versions used for master.
- If there are any changes - if you’re upgrading one of the libraries and it’s troublesome, for instance - then add a libraries subdirectory to your feature branch folder so that you don’t crash production.

set_environment

Here it is in its entirety

%run ./conf/ci_config

And replace ci_config with whichever config you want to be using.

set_environment vs conf

The reason that we need a set_environment file is so that it's easy to switch between environments (prod and dev) without having to edit a real notebook. It decouples the environment from the code.

For example, what if we didn't have a set_environment file, and just specified whichever config file at the top of our Model Training notebook? Then everytime we were making a change, we'd have to remember to change our conf file reference first, and then change it back to prod before we commit the file to source control. And chances are that we would eventually use our prod settings in dev, or our dev settings in prod. Meaning that we could overwrite an important delta file, or not realize that our "prod" run has actually just been a "dev" run for a few weeks.

Or how would you dynamically set the environment? With CI, for example. You'd have to do some string parsing and remove the 3rd command from the top of the notebook or whatever, and it would be kind of messy and brittle. With the set_environment file, CI just needs to dump a single line into a new file for everything to work.

Secrets

Secrets (any sort of key or password) should be stored in Azure Key Vault or Databricks secrets.

Secrets should never be put directly in a Databricks notebook unless you clear the revision history afterwards. Otherwise, anyone with access to the notebook will be able to go back and find them.

Security! Connecting to data sources

One important point is that we don't want to mount the data lake as a drive due to security concerns. If we do that, then anyone with access to the Databricks instance can see everything in the data lake. By providing the oauth tokens, we can control access better.

See the Databricks docs for more info.

Continuous Integration (CI)

This is still fairly new, but has actually saved a whole bunch of time already. Also, it really takes the mental weight off when you think about running a deployable version of your code.

The idea is that every time you push to GitHub, the job will run the model training notebook. That notebook runs on only a small subset of the data so that it goes quickly, meaning that you'll have to make your data source a little dynamic (ie - put it in your environment conf). The notebook can have some assert statements to make sure that things pass some baseline expectation. And then CI will tell you whether it passed or failed.

So it’s on you to define any asserts or tests inside of your notebooks, but otherwise, if an exception happens for another reason, it will fail and it will show that in GitHub.

So you’ll want to set something up in Azure Pipelines and configure your azure-pipelines.yml file (see the Git project structure)

import base64  
import os  
import time

from databricks_api import DatabricksAPI  
from databricks_cli.workspace.api import WorkspaceApi  
from databricks_cli.sdk.api_client import ApiClient


BASE_PATH = os.path.dirname(os.path.dirname(__file__))

# Pre-defined Pipelines variables
# See https://docs.microsoft.com/en-us/azure/devops/pipelines/build/variables?view=azure-devops&tabs=yaml
PIPELINES_BUILD_ID = os.environ['PIPELINES_BUILD_ID']  
GIT_BRANCH_NAME = os.environ['GIT_BRANCH']

NOTEBOOK_DIR = '/Projects/ProjectName/CI/{}-{}'.format(  
    PIPELINES_BUILD_ID, GIT_BRANCH_NAME)

api_key = os.environ['DATABRICKS_API_KEY']  
headers = {  
    'Authorization': 'Bearer {}'.format(api_key)
}

databricks = DatabricksAPI(  
    host='https://eastus2.azuredatabricks.net',
    token=api_key
)


def send_code_to_workspace():  
    client = ApiClient(
        host='https://eastus2.azuredatabricks.net',
        token=api_key
    )
    workspace_api = WorkspaceApi(client)
    workspace_api.import_workspace_dir(
        source_path=BASE_PATH,
        target_path=NOTEBOOK_DIR,
        overwrite=True,
        exclude_hidden_files=True
    )


def send_config_to_workspace():  
    env_file_path = os.path.join(BASE_PATH, 'conf', 'ci_set_environment.py')
    set_env_file_contents = open(env_file_path, 'rb').read()
    set_env_base64 = base64.b64encode(set_env_file_contents).decode('ascii')

    # Create set_environment file
    databricks.workspace.import_workspace(
        path='{}/set_environment'.format(NOTEBOOK_DIR),
        content=set_env_base64,
        language='PYTHON',
        overwrite=True
    )


def wait_for_complete(run_id):  
    stopped_states = ['TERMINATED', 'SKIPPED', 'INTERNAL_ERROR']

    sleep_time_seconds = 60 * 2
    run_info = {}

    status = None
    while status is None or status not in stopped_states:
        if status is not None:
            time.sleep(sleep_time_seconds)

        run_info = databricks.jobs.get_run(run_id=run_id)
        status = run_info['state']['life_cycle_state']

    return run_info


def execute_on_databricks():  
    send_code_to_workspace()
    send_config_to_workspace()

    job_name = 'CI-{}-{}'.format(PIPELINES_BUILD_ID, GIT_BRANCH_NAME)
    cluster_info = {
        'spark_version': '5.3.x-scala2.11',
        'autoscale': {
            'min_workers': 2,
            'max_workers': 13
        },
        'custom_tags': {
            'cluster-purpose': 'CI-testing',
        },
        'node_type_id': 'Standard_F8s_v2',
        'driver_node_type_id': 'Standard_F8s_v2',
        'spark_env_vars': {
            "PYSPARK_PYTHON": "/databricks/python3/bin/python3",
        }
    }

    job_response = databricks.jobs.submit_run(
        run_name=job_name,
        new_cluster=cluster_info,
        libraries=[
            {'pypi': {'package': 'pandas'}},
            {'pypi': {'package': 'sentry-sdk'}},
        ],
        notebook_task={
            'notebook_path': '{}/ModelTraining'.format(NOTEBOOK_DIR),
        }
    )

    run_id = job_response['run_id']

    run_info = wait_for_complete(run_id)
    result_state = run_info['state']['result_state']
    if result_state != 'SUCCESS':
        print(run_info)
        raise Exception("Databricks run not successful - status {}".format(result_state))

    print('job id', run_id)


print(GIT_BRANCH_NAME, PIPELINES_BUILD_ID)

execute_on_databricks()

What that example script does is basically orchestrate sending the right version of code to Databricks + creating a cluster + executing the pipeline along with the assert tests + waiting for results.

In order for this to work, you’ll need to have a subset of data available in your Databricks CI conf settings, because the job needs to take like 25 minutes or less, or it will just fail. You want the fast feedback at this point, anyways. Eventually we’ll write something that can do a more complete run on the data.

Appendix

Helpful reads

Code

Azure-pipelines.yml

This is what we use for our project

# Python package
# Create and test a Python package on multiple Python versions.
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
# https://docs.microsoft.com/azure/devops/pipelines/languages/python

pr:  
  autoCancel: true
  paths:
    exclude:
    - README.md

jobs:

- job: 'Test'
  pool:
    vmImage: 'Ubuntu-16.04'
  strategy:
    matrix:
      Python36:
        python.version: '3.6'
    maxParallel: 4

  steps:
  - task: UsePythonVersion@0
    inputs:
      versionSpec: '$(python.version)'
      architecture: 'x64'

  - script: python -m pip install --upgrade pip && pip install -r requirements.txt
    displayName: 'Install dependencies'

  - script: python -m unittest
    displayName: 'unittest'

  - script: python ./forecasting/ci/run_in_databricks.py
    displayName: 'Run in Databricks'
    env:
      GIT_BRANCH: $(Build.SourceBranchName)
      PIPELINES_BUILD_ID: $(Build.BuildId)
      DATABRICKS_API_KEY: $(DATABRICKS_API_KEY)

  - task: PublishTestResults@2
    inputs:
      testResultsFiles: '**/test-results.xml'
      testRunTitle: 'Python $(python.version)'
    condition: succeededOrFailed()

Send your notebooks to Databricks

Run this file like ./send-to-databricks.ps1 BI-4811. It was written for Windows, but a shell script should be similar.

# Send your current code to a dir in databricks. It will overwrite any conflicting files.


$SCRIPTPATH=$PSScriptRoot
$PROJECT_ROOT = (get-item $SCRIPTPATH).parent.FullName
$DATABRICKS_BASE_DIR="/Projects/ProjectName"
$BRANCH_NAME=$Args[0]
$DATABRICKS_DEST_DIR="$DATABRICKS_BASE_DIR/$BRANCH_NAME"
$LOCAL_CODE_DIR="$PROJECT_ROOT/forecasting/"

if ($BRANCH_NAME -eq $Null) {  
    throw "No git branch/directory name"
}

echo "checking out $BRANCH_NAME into Databricks $DATABRICKS_DEST_DIR"

databricks workspace import_dir --exclude-hidden-files --overwrite $LOCAL_CODE_DIR $DATABRICKS_DEST_DIR  
databricks workspace rm $DATABRICKS_DEST_DIR/set_environment

echo "Complete"

Deploy your library code to Databricks

You'll need to set up databricks-cli for this bash script to work. And be mindful, cuz this will overwrite everything.

SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

. $SCRIPTPATH/env/Scripts/activate
python setup.py bdist_egg  
databricks --profile MYPROFILE fs cp -r --overwrite ./dist dbfs:/libraries  
deactivate

In Soviet Russia, ad finds you

Kyle Valade — Fri, 28 Feb 2020 13:10:11 GMT

This is a post I wrote for work to explain the basics and foundations of ad tracking. It was meant for a fairly nontechnical audience, so there are probably some oversimplifications. There are also possibly some mistakes because I'm just looking at ads from the outside - I've never really dealt with them professionally. Just, like, every day on the internet.

So there you are, hanging out, talking about your favorite cleaning product. Next thing you know, you see an ad for Tide on Instagram. What happened? Is your phone listening to you? Or is there something deeper and more complex going on behind everything? Hopefully this article will shed some light, though maybe not on the conspiracy you thought. I'll try to keep it not very technical.

Cookies

BUT FIRST: cookies! You need to know about cookies. Remember back in the 90's when you thought they sounded cute? They're the source of everything, but they’re just tiny bits of text that are attached to a website - little files, really. They're pretty helpful - they keep us logged in, they save things that we add to a cart, they save our Dark Mode preferences, and our darkest secrets.

You can see them if you open up your Chrome dev tools and go to the Application Tab like below. This is just a sample of the cookies we load on carhartt.com.

Looks pretty boring, right? Well, look at the Domain column there...see anything funny? Facebook? Yahoo? doubleclick.net? channeladvisor.com? Those aren't Carhartt at all. They are, in fact, third-party cookies. How did they get added to carhartt.com?

HTTP requests!

Sorry, did I say this wouldn't get technical? You also need to know about HTTP requests. You simply must. HTTP is what the internet is built on. It's how you view webpages and look at good dogs all day. Whenever you go to a website, your browser asks that website for the page's code (that's called an HTTP request), and the web server has a record of all of the requests. That record can look something like this:

It's also important to note that everything else that a website pulls in goes through the same process. When you think of everything else that might be pulled in as part of a website, think: images, videos, code to make the page fancy and good-looking, code for analytics, etc. These are called "resources" in the biz. So up there, you might see requests for images (there's a line for GET /apache_pb.gif).

One of the cool parts of the internet is that people are able to kind of borrow other people's images. When they do that, a request is sent out to the site that "owns" the image ("hosts" would be the technical word). When you load a page where a picture is being pulled in like that, what you don't see is that the "owner" of the picture gets a tiny bit of information about you. For instance, if you're on carhartt.com, and we're loading an image from Scene7, then Scene7 will know exactly where that image is going. Or if I'm clicking on a page from Google, the site will know that, too. It's called the "Referer" [sic]. Along with that information, there are also some cookies that are being passed back and forth.

Here's what Chrome is sending to a website after I google "What are my http headers" and click on one of the links:

The referer is telling that website that I came from Google. The User-Agent is also interesting. You can check it out yourself here. Note that these headers are easily spoofed, so it's not necessarily the best way of collecting information.

Bringing it all together

Probably not super exciting so far, but this is where everything finally starts to tie together. Cookies and HTTP are important to know about because most websites load things from all over, which is why privacy has become such a big concern. Remember that sample of cookies I showed above? Carhartt.com is loading things from all of those sites and more. And when you load a resource, any cookies for that site are sent along with the rest of the information.

Let’s see how that applies to Facebook and Google. Not that they are the only examples - adtech is an area with a whole lot of different businesses involved. But they are two that everyone knows, and that probably have the most complex operations.

Facebook

Let's say 69% of Americans have a Facebook account. Let's say you logged into Facebook last night and were cruising your news feed. Imagine that "like" buttons are still popular (they don't need "like" buttons to get this info anymore). And also imagine that Carhartt has a "like" button on our product pages.

That "like" button is really just a snippet of code that Facebook tells the website people to copy-paste into the page source. In that code, there is an image that's loaded from facebook.com. Say you're looking at our classic Chore Coat, and that there is a "like" button underneath. Just by going to that page, Facebook knows that you're considering purchasing that chore coat. And they know that it might be ill-fitting what with your philosophy and computer science degree and that she never did love you after all...just for example...

How do they know all that? Well - remember cookies? You've been to Facebook.com, so there is a cookie set for that site. The cookie is essentially the key to your profile - it’s how you can close your tab and then come back to it later without having to sign in again. Facebook.com has a lot of your personal information - like a list of your fav post rock bands from 2005. You are loading their image, so you are sending them the referer (the Carhartt Chore coat page) and your profile (via the cookie). Now Facebook has a network of sites that you've visited and products that you've viewed and can tie that right to your identity.

Now imagine that you also have a mobile device. Imagine it's smart and that you visit websites with that device. Including Facebook. Now Facebook can tie different devices to your profile and can probably infer which are for work. And they definitely know your location (google device graphing).

Now imagine that you have friends. Imagine that those friends, too, have mobile devices. Imagine that they also visit Facebook and give them their location without necessarily realizing it. Imagine that you hang out with those friends. Well, Facebook already knew that - they have your friend network and everyone's locations.

Google + other adtech companies

My guess is that all ad networks would kind of work the same, but with less data than Google or Facebook (so they can argue that their ads are better and sell them for more $$).

When a website joins an ad network, let's say Google's specifically, they tell you to put a script in your site. A script can run any code it wants once it's loaded, so it's hard to say exactly what is in that script. But at a minimum, you'll be sharing the same information you're giving Facebook through those "like" buttons. Google is able to build up a network of sites that you've visited, so they know what to show you, or can guess, or just show something generic or profitable. They also know what you click on, and probably what your mouse hovers over, or if you've stopped scrolling so you can read the ad.

Not to get conspiratorial, but...

Remember Google+? My guess would be that they created that social network so they could tie ads directly to your profile. And it kind of worked. I'm always signed into Google when I'm searching, so they can tie everything right back to me.

Remember Chrome? Boy is it a popular browser. And they made it super easy to stay signed into Google and sync between devices.

Do you ever use Google Maps or Waze?

Have an Android phone?

Anyways.

Beyond cookies

I mentioned above that Facebook no longer needs “Like” buttons to track your browsing habits. Now that Facebook has proved itself to be a very valuable advertising tool, people will happily let them take whatever information they want (see: Facebook pixel). Facebook isn’t the only company that does this, of course, they are just a stand-in for any other company that likes data. Google and Amazon definitely do it. I’m sure Adobe does. Even educated fleas do it. Any third-party script that someone adds to their page can take whatever information they want from the user.

What is a script? A script in this context is a chunk of code that people add to their site, and it runs whenever the page is loaded. Take Google Analytics - they give you a snippet of code, which in turn loads some of their code, and gives you the user journeys on your site. There are even some websites that use their users to mine bitcoin (not fantastic for battery life). That code is often obfuscated (“minified”), so again, it’s hard to say exactly what each script does. You mostly have to trust the organization whose code you’re pulling in.

Now that all of these websites have scripts from Facebook and Google added willy nilly, and since those are two of the major online advertisers (and gatekeepers for your online identity), there is less of a need for third-party cookies. Which is convenient, because the world is actually moving away from third-party cookies as a means of tracking (Mozilla and Safari block them by default + Chrome will be doing it soon).

I don't really want to get into mobile apps here, but assume that they are, in some ways, even more permissive than scripts. Apple won't let apps do any sneaky bitcoin mining, but the apps are often given access to your location, photos, contacts, etc.

Targeted advertising

The companies have all of your data - now let’s look at what they do with it:

Do you see that, or have you unconsciously blocked it?

For all of the data they collect, what you see is still just some banner ad. Maybe it’s for a brand that’s a bit more relevant. There’s that cliche: “The greatest minds of our time are thinking about how to make people click ads”, which...yeah, probably. The relatively new Amazon ad team started bringing in $1 billion in like a year. That’s because with all of the knowledge that they have of consumers, these companies can charge a lot of money - up to 500% more - for targeted ads vs. untargeted ads. That is, ads that show you something tailored to your personal profile vs. ads that show you something less bespoke. That is, ads that track you vs. ads that don’t. There are reports that targeted advertising isn’t worth the extra money, so we’ll see how that shakes out, but people generally seem to think they are. Contextual ads are one of the alternatives.

There are other applications of your data, too, like training facial recognition algorithms based off of all of the labelled pictures that have been posted to Facebook and Instagram or creating recommendation engines (think Netflix or Amazon's suggestions).

Privacy

The reason that a lot of sites join ad networks or bring in third-party scripts from adtech companies is simple - they want to make money. They want each person that reads their blog to give them a fraction of a penny so maybe they can dream of paying their bills with the proceeds.

When businesses do it, it’s typically because they want to sell more of their product. Large parts of Facebook, etc. are dedicated to proving to their advertisers that the ad spend of those companies is generating large returns (how honest they are is up for debate). So companies put the scripts on their site.

User privacy usually isn’t part of the equation, or if it is, well, they have to drive traffic, convert users, and sell things. I mean, imagine if a company didn’t do that - it would be crazy - just completely irresponsible. Who knows - they might go out of business because of it. And, what, big tech knows that the user is looking at well-made, hardworking apparel? That’s hardly the end of the world.

And, as a web developer myself, that stuff is hard to avoid adding. You need to know your traffic and the users on your site, and there isn’t time to build something new. So if Google has a tool I can add that’s free or cheap, and that has been tested by essentially the entire internet, I’m probably going to go with that. Repeat that process for other foundational tools.

The problem is that the number of sites with those tools add up, creating a huge network, with the effect that big tech knows about 90% of the sites you visit, and can track you everywhere you go (via phone location data). How much of what you think about is reflected in search terms or online research? Messages to friends? Photos, and online posts? That's a whole lot of your physical, mental, and emotional state that is being tracked by these companies. And oftentimes sold, directly or indirectly.

How can you guard against that? Here is a guide. Or, off the top of my head, you can block third-party cookies, for one. Turn off location sharing in your apps + exit them when you aren’t using them. And/or opt for their websites instead. Firefox also has better privacy defaults than Chrome. Delete Facebook? Or you can use Tor and never sign into social networks. Eh - better just use that guide.

There are also some interesting ideas where you generate thousands of online profiles with your name and a bunch of different interests so that the real you is obfuscated.

Conclusion

So that's how, once you go to a company's website, you start seeing their logo on top of half the internet. It began with cookies and HTTP and has since evolved. I hope that answered all of your- what's that? The Instagram ad for Tide? Oh, no, Facebook was definitely listening to you.

Etc.

See this Reply All episode - "Is Facebook Spying on You?"
The Verge - Guide to protecting your data
Mozilla article about online tracking strategies and how to protect yourself
Great article from Normcore Tech about where Apple's pro-privacy stance hits reality
The Verge - "How to increase your privacy online"
The Verge - “Some major Android apps are still sending data directly to Facebook: Even when you’re not logged in or don’t have a Facebook account”
The Verge - "How advertising cookies let observers follow you across the web"
HTTP referer

Yeah, pretty heavy on The Verge...

ODBC and DB2 - problem saving a large chunk of text - CWBNL0107

Kyle Valade — Thu, 30 Jan 2020 15:17:07 GMT

I'm using PyODBC to connect to DB2 and was seeing some problems saving certain rows of data. The column the error message mentioned was a CLOB type, and the error happened when the column had data larger than, say 10,000 characters. The error message reads thusly:

(pyodbc.DataError) ('22018', '[22018] [IBM][System i Access ODBC Driver]Column 6: CWBNL0107 - Converted 9739 bytes, 4869 errors found beginning at offset 0 (scp=1202 tcp=37 siso=1 pad=0 sl=9739 tl=19478) (30200) (SQLPutData); [22018] [IBM][System i Access ODBC Driver]Error in assignment. (30019)')

Googling the error, some people had mentioned the charset, so I tried tweaking that a little bit. Didn't work. Plus, some records were being saved, so it didn't make sense.

Then IBM themselves recommended turning on the Allow unsupported character option (in this case, through the AllowUnsupportedChar/ALLOWUNSCHAR arg in the connection string, though that part isn't really documented, like pretty much everything related to db2). I tried that, and I didn't get an error. But it mangled like half the data in a large row.

Parentheticals aside - not a good solution. I prefer the error, thank you very much. And I still suspected that the problem had to do with the size of the data.

Then I found this site which seems to be documenting all of the db2 connection string args, and it is a magical wonderland. I did a ctrl + f and searched for "length" and boom:

The MAXFIELDLEN keyword, (can also be specified as MaxFieldLength), controls how much LOB (large object) data is sent in a result set. The value indicates the size threshold in kilobytes and the default value is 15360 and in V5R2 the maximum value allowed is 2097152, (2MB). If a LOB is larger than this value, you will have to use subsequent calls to retrieve the rest of the LOB data in your application.

Now I think there's something wrong with the math there, because 2097152KB certainly isn't 2MB. The default was also far from 15360KB in my system. Anyways, I set MAXFIELDLEN=2056 in my connection string (see here for my post about the connection string) and everything worked like magico. I hope it works like magico for you, too.

Connect to DB2 from python with SQLAlchemy

Kyle Valade — Wed, 22 Jan 2020 14:45:19 GMT

This is kind of a sister post to my Databricks-specific post from the other day.

It's amazing how much time you can spend searching through docs, fiddling with connection strings, and trying different engines because some IBM don't seem to be working very well or the docs aren't quite up to date or whatever.

Maybe there are other people that need to use DB2 for whatever godawful reason. Maybe those people want to start using Python or Airflow or something. Maybe those people are just me six months from now. Here is what got everything working.

TLDR; Use pyodbc with ibm_db_sa. The connection string should look like

'ibm_db_sa+pyodbc400://{username}:{password}@{host}:{port}/{database};currentSchema={schema}'

Now for the long form answer...

# requirements.txt

ibm_db  
ibm_db_sa  
pyodbc  
SQLAlchemy

# database_engine.py

from contextlib import contextmanager

from sqlalchemy import create_engine  
from sqlalchemy.orm import sessionmaker


def create_database_engine():  
    connection_string = 'ibm_db_sa+pyodbc400://{username}:{password}@{host}:{port}/{database};currentSchema={schema}'.format(
        username='',
        password='',
        host='',
        port='',
        database='',
        schema=''
    )
    return create_engine(connection_string)


engine = create_database_engine()


def create_session():  
    Session = sessionmaker(bind=engine)
    return Session()


@contextmanager
def session_scope():  
    """Provide a transactional scope around a series of operations."""
    session = create_session()
    try:
        yield session
        session.commit()
    except Exception:
        session.rollback()
        raise
    finally:
        session.close()

Here is the model:

# models.py

from sqlalchemy import Column, Integer, String, DateTime, Text, MetaData  
from sqlalchemy.ext.declarative import declarative_base

from database_engine import engine


metadata = MetaData(schema='Restaurant')  
Base = declarative_base(bind=engine, metadata=metadata)


class Transaction(Base):  
    __tablename__ = 'Transaction'

    id = Column('ID', String(100), primary_key=True)
    store_id = Column('STORE_ID', Integer)
    created_time = Column('CREATED_TIME', DateTime)
    transaction_json = Column('TXN_JSON', Text)

And a sample use:

from database_engine import session_scope  
from models import Transaction


if __name__ == '__main__':  
    with session_scope() as session:
        results = session.query(Transaction).limit(10)
        for result in results:
            print(result.id)

Access DB2 From Databricks

Kyle Valade — Thu, 16 Jan 2020 20:55:14 GMT

This took me a good few hours to figure out. So hopefully it will help you and my future self.

install com.ibm.db2.jcc:db2jcc:db2jcc4 on your cluster from maven
Get your license file dir (this is a whole process in itself)
From your license info, copy the jar file (mine is like db2jcc*.jar) up to databricks using databricks-cli.
- I copied them to a tmp dir and then moved them to /dbfs/FileStore/jars/maven/com/ibm/db2/jcc/license from a notebook, but that might not be necessary
- You might also have to copy the .lic files into the same dir, but, again, I haven't validated that.
install that jar on your cluster as a library
restart your cluster

Then you can run this (python) code:

connection_string = 'jdbc:db2://{host}:{port}/{database}:currentSchema={schema};database={database};user={username};password={password};'.format(  
    host=host, 
    port=port,
    schema=default_schema,
    database=database, 
    username=username, 
    password=password
)

rdd = spark.read.format("jdbc") \  
    .option('url', connection_string) \
    .option('driver', 'com.ibm.db2.jcc.DB2Driver') \
    .option('dbtable', 'my_table') \
    .load()

display(rdd)

Hurrah!

7x performance improvement for my dead slow SQL Server + Django pyodbc queries

Kyle Valade — Fri, 06 Dec 2019 19:14:37 GMT

In one of my Django projects, I'm connecting to a SQL server database, and I'm doing this with django-pyodbc-azure. I noticed that my query performance was incredibly slow in a lot of places.

For a simple query where I was just selecting 50 rows, it would be taking like 11 seconds to complete. That's after making sure all of the relevant columns were indexed. At first I thought that maybe the Django Rest Framework performance was a lot worse than I remembered. Pagination, maybe? But digging in, it became clear that it was just the query execution.

Alright, so here's an example query:

SELECT TOP 50 * FROM Customer WHERE email LIKE '%you@gmail.com%'

The email itself was actually parameterized in Django/pyodbc, so it would be sent more like the following pseudocode:

params = ('you@gmail.com',)  
query = "SELECT TOP 50 * FROM Customer WHERE email LIKE '%' + %s + '%'"  
rows = cursor.execute(query, params)

It was strange because I'd try hardcoding the parameters in the query so it would be more like the top one and the performance was fine. Thus it had to be something with the parameters.

And it was! A Google query led me to some suspicious looking unicode talk in the pyodbc wiki. My database is encoded as something like latin1, while pyodbc converts everything to unicode by default. Meaning that, db-side, parameters were being cast from utf8 strings to latin1, which kills the performance of the indexes. This article pretty much gives the fix. Essentially:

Check your SQL Server collation using:

select serverproperty('collation')

If it is something like "SQLLatin1GeneralCP1CI_AS" and you want str results, you may try:

cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1', to=str)  
cnxn.setencoding(str, encoding='latin1')

And that was it, really. But how to apply that to django-pyodbc-azure? I ended up overriding the db engine. Django expects you to put this as base.py in its own package.

# utils.db_engine.base

from datetime import datetime

import pyodbc  
from sql_server.pyodbc.base import DatabaseWrapper as PyodbcDatabaseWrapper


# Inspired by https://github.com/michiya/django-pyodbc-azure/issues/160

class DatabaseWrapper(PyodbcDatabaseWrapper):

    def get_new_connection(self, conn_params):
        """
        Need to do this because pyodbc sends query parameters as unicode by default,
        whereas our server is latin1.

        Refs:
        [1]: https://github.com/mkleehammer/pyodbc/wiki/Unicode
        [2]: https://github.com/mkleehammer/pyodbc/issues/376
        [3]: https://github.com/mkleehammer/pyodbc/issues/112
        """
        connection = super().get_new_connection(conn_params)
        connection.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
        connection.setencoding(encoding='latin1')

        return connection

And then change your settings

# settings.py

DATABASES = {  
    'default': {
        'ENGINE': 'utils.db_engine',
        ...
    }
}

And that was pretty good. That brought lots of queries down to milliseconds. But a few of them were still disturbingly high - particularly the ones without many results. Some of those still took about 3 seconds.

Can you guess what the issue is?

Let's look at the query again:

SELECT TOP 50 * FROM Customer WHERE email LIKE '%you@gmail.com%'

The leading wildcard essentially turns the query into a full-text search, which also kills the index. Turns it into a scan instead of a seek, which isn't nearly as good. There aren't a lot of great options for this in SQL Server. You can create a full-text index on all of the fields where you'll be doing these types of searches. In my case, I decided that it's probably fine to just do a startswith query and drop the leading wildcard. In Django Rest Framework, this was pretty easy. In my viewset, I just added a carrot to the beginning of my search_fields.

class CustomerViewSet(viewsets.ReadOnlyModelViewSet):

    search_fields = ['^email', '^first_name', '^last_name']

And then I was happy.

Django in Azure Web Apps - too many redirects

Kyle Valade — Fri, 25 Oct 2019 18:10:35 GMT

I've been testing out Azure Web App services lately so I can avoid all of the reverse proxy/server management mumbo jumbo.

When I was deploying my Django app, I hit this issue. I'd try to go to the site in Chrome, but got a "Too many redirects" error. I tried turning off HTTPS redirection in the Azure portal, but to no avail. Nothing in my Sentry logs. The other logs all looked fine, too.

Finally, after like 2 hours of re-looking at my logs, googling, and blaming gunicorn, I looked back at my django settings. Turns out I had done some premature securing of the website and had added SECURE_SSL_REDIRECT = True. So, of course, that was the issue. As soon as I switched that to False and redeployed, the issue disappeared.

There you have it. Make sure that SECURE_SSL_REDIRECT = False. Azure can do all that for you, anyways.

Working with SQL Server in Django pt. II

Kyle Valade — Wed, 16 Oct 2019 14:39:24 GMT

This is a follow-up and brutal takedown of a post I wrote about two years ago. Go here for part I.

I'm starting to use Django again lately and am integrating with our data warehouse, contrary to my own apparent recommendation against that from a couple of years ago. None of this is in production yet, mind you. Just like before. But this time, development is going a lot smoother.

Last episode...

What were the issues we were facing before?

We have two databases - the app database and an external one, which is our data warehouse. The data warehouse is what we'll be focusing on here.
our data warehouse is SQL server, which isn't supported by Django out of the box
- Using SQL server itself isn't too bad because there's a decent package to help with that - django-pyodbc-azure - though it's a little behind.
and the database uses schemas, which Django abhors.
- This takes some code
and the database is unmanaged, which isn't fun for testing and local development
- More code to get around this

The journey

Two databases

That one's pretty easy. See Django's docs. The main thing here is that you have to create a Database Routing class, which is just a class with a few methods. It would extend an interface if this were any other language, but in this case it doesn't even have to do that.

# database_router.py

from django.conf import settings


class DatabaseRouter:

    def db_for_read(self, model, **hints):
        return getattr(model, 'database', 'default')

    def db_for_write(self, model, **hints):
        return getattr(model, 'database', 'default')

    def allow_relation(self, obj1, obj2, **hints):
        return True

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        if db == 'default':
            return True

        return settings.ALLOW_WAREHOUSE_DB_MIGRATION

And add your router to your settings. We'll look at your db settings next.

# settings.py

DATABASE_ROUTERS = ['utils.database_routers.DatabaseRouter']

The only trick here was to determine which database each model maps to, as I couldn't find an attribute to use by default. So I created one. Initially, I just used a static attribute in each model, like so:

# models.py

class MyModel(models.Model):  
    database = 'data_warehouse'

    ...

That's a bit of a pain to remember to do and gets fixed up later on, but the router still needs that attribute.

Django + SQL server

Again, django-pyodbc-azure.

pip install django-pyodbc-azure

# settings.py

DATABASES = {  
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
    },
    'data_warehouse': {
        'ENGINE': 'sql_server.pyodbc',
        'NAME': 'DataWarehouse',
        'HOST': '192.168.1.1',
        'PORT': '1433',
        'USER': config.get('database', 'username'),
        'PASSWORD': config.get('database', 'password'),
        'AUTOCOMMIT': False,  # Just cuz I treat this as read-only
        'OPTIONS': {
            'driver': 'ODBC Driver 13 for SQL Server',
            'host_is_server': True,
        }
    }
}

Creating a model

So now you're about ready to map out some models to your data warehouse, right?

# models.py

class Customer(models.Model):

    database = 'data_warehouse'

    id = models.AutoField(primary_key=True, db_column='Id')
    name = models.CharField(max_length=128, db_column='FullName')

    class Meta:
        managed = False
        db_table = ...?

Hint: remember to use AutoField for the primary key (if it's your standard auto-incrementing pk) - otherwise when you do Consumer.objects.create(), the id won't auto-populate and you'll waste two hours trying to get your tests to work.

Anyways...by golly, that table is in a schema, how do we specify the schema we want?

That's an ugly bit, but manageable. It involves a bit of SQL injection. But...the good kind...?

db_table = '].[

so our metaclass would be...

class Customer(models.Model):

    class Meta:
        managed = False
        db_table = 'B2C].[Customer'

See? Not too bad. Not fun. But not too bad.

Regroup!

Cool, so now we can read from our database. And that's fine, if we have a pre-populated database to read from. But what if we want to run some tests?

Well...the database is unmanaged, so migrations won't be applied to our test db. Fine, so turn off migrations and have django do the equivalent of a sync db. Cool, but our models are still unmanaged so we'll run into issues.

What else? Oh yeah, there's still that ugly stuff we have to do for each data warehouse model. And there are a lot of such models.

Testing

This ended up being fine. Just some code.

First, we'll disable migrations. Turns out it's not as easy as SOUTH_DISABLE_MIGRATIONS = True anymore, but more dynamic, I guess. So, in your settings or wherever:

# settings.py

class DisableMigrations(object):

    def __contains__(self, item):
        return True

    def __getitem__(self, item):
        return None


MIGRATION_MODULES = DisableMigrations()

Easy? Easy.

To tell Django that you want your model to be managed during tests, you just need to write a custom TestRunner that goes through all of your models, checks their meta, and changes the managed attribute if necessary. I really just copied/pasted this code from here or here

# settings.py

class UnManagedModelTestRunner(DiscoverRunner):  
    '''
    Test runner that automatically makes all unmanaged models in your Django
    project managed for the duration of the test run.
    Many thanks to the Caktus Group: http://bit.ly/1N8TcHW
    '''

    def setup_test_environment(self, *args, **kwargs):
        from django.apps import apps
        self.unmanaged_models = [m for m in apps.get_models() if not m._meta.managed]

        for m in self.unmanaged_models:
            m._meta.managed = True

        super(UnManagedModelTestRunner, self).setup_test_environment(*args, **kwargs)

    def teardown_test_environment(self, *args, **kwargs):
        super(UnManagedModelTestRunner, self).teardown_test_environment(*args, **kwargs)
        for m in self.unmanaged_models:
            m._meta.managed = False


TEST_RUNNER = 'my_project.settings.UnManagedModelTestRunner'

Cool, so now you can run your tests and save your test objects to the database. The rest is really just icing.

Abstract the boilerplate

This part was kind of fun because I had never gotten to mess around with meta classes before. Essentially, I wanted to add a schema attribute to Meta, and then have the base class do the sql injection for me. Then that garbage isn't littered all around the code. Also not having to specify the database would be nice.

I ended up creating a DataWarehouseModel that extends model.Model and takes care of all that for me.

from django.db.models.base import ModelBase  
from django.db import models  
from django.conf import settings


DEFAULT_DB_FORMAT = '{schema}__{table}'

# Ugly string injection hack so that we can access the table under the schema
# See: http://kozelj.org/django-1-6-mssql-and-schemas/
SQL_DB_FORMAT = '{schema}].[{table}'


class DataWarehouseMeta(ModelBase):

    def __new__(typ, name, bases, attrs, **kwargs):
        super_new = super().__new__

        # Also ensure initialization is only performed for subclasses of Model
        # (excluding Model class itself).
        parents = [b for b in bases if isinstance(b, DataWarehouseMeta)]
        if not parents:
            return super_new(typ, name, bases, attrs)

        meta = attrs.get('Meta', None)
        if not meta:
            meta = super_new(typ, name, bases, attrs, **kwargs).Meta

        # ignore abstract models
        is_abstract = getattr(meta, 'abstract', False)
        if is_abstract:
            return super_new(typ, name, bases, attrs, **kwargs)

        # Ensure table is unmanaged unless explicitly set
        is_managed = getattr(meta, 'managed', False)
        meta.managed = is_managed

        # SQL injection garbage
        meta.db_table = typ.format_db_table(bases, meta)

        # Delete my custom attributes so the Meta validation will let the server run
        del meta.warehouse_schema
        del meta.warehouse_table

        attrs['Meta'] = meta
        return super().__new__(typ, name, bases, attrs, **kwargs)

    @classmethod
    def format_db_table(cls, bases, meta):
        table_format = DEFAULT_DB_FORMAT

        model_database = bases[0].database
        db_settings = settings.DATABASES.get(model_database)
        engine = db_settings['ENGINE']

        if engine == 'sql_server.pyodbc':
            table_format = SQL_DB_FORMAT

        return table_format.format(
            schema=meta.warehouse_schema,
            table=meta.warehouse_table
        )


class DataWarehouseModel(models.Model, metaclass=DataWarehouseMeta):

    database = 'data_warehouse'

    class Meta:
        abstract = True

And that class is used like so:

# models.py

class Customer(DataWarehouseModel):

    id = models.IntegerField(...)

    class Meta:
        warehouse_schema = 'B2C'
        warehouse_table = 'Customer'

And that's it.

What DataWarehouseMeta is doing is:

setting managed = False by default
generating db_table from our custom attributes (warehouse_schema + warehouse_table)
setting db_table to what will be passed to the super class - Django's Meta implementation
deleting the custom attributes, since Django makes sure there aren't any extra attributes lying around its Meta class

This StackOverflow answer was a lot of help.

Summary

So that's it! Nothing a little elbow grease can't fix. I now officially endorse using Django with SQL Server - even complex ones with schemas. The caveat is that we are relying on django-pyodbc-azure, which is a couple of versions behind.

If I get a basic example repo going on GitHub, I'll link it here later.

Calculate grouped YTD totals for previous years in Pandas

Kyle Valade — Thu, 01 Aug 2019 13:28:08 GMT

I want to calculate the YTD total for the last couple years for every customer + product combination that has been sold. I'm new to Pandas and actually spent kind of a long time on this problem, but the solution turned out to be pretty simple. Assume today is August 1, 2019.

I have data that looks like...

|   | OrderDate  | Customer | Product | OrderAmount |
|---|------------|----------|---------|-------------|
| 1 | 2018-02-10 | 1        | 10      | 10.00       |
| 2 | 2018-05-11 | 2        | 11      | 5.00        |
| 3 | 2018-09-10 | 1        | 10      | 10.00       |  # Don't include in YTD!

At the end, I want a dataframe that looks something like this:

|   | Customer | Product | 2018_total | 2017_total |
|---|----------|---------|------------|------------|
| 1 | 1        | 10      | 10.00      | 0          |
| 2 | 2        | 11      | 5.00       | 0          |

And it has to be performant because there's a lot of data. So iterrows is out, as is groupby().apply(), because that thing is ungodly slow (it was taking real seconds per group).

What I ended up doing was creating a year column (I cheated and got it from the DB), copying the columns that I wanted to index into new columns (probably cuz I'm a noob), and then just doing a df.query().groupby().sum() into a new column.

Now obviously you don't need a year - you could just do a x < y < z, but the year helped for other things, so it's staying, dammit.

So now our dataset looks like...

| Index(Customer/Product) | OrderDate  | Customer | Product | OrderAmount | Year |
|-------------------------|------------|----------|---------|-------------|------|
| 1/10                    | 2018-02-10 | 1        | 10      | 10.00       | 2018 |
| 2/11                    | 2018-05-11 | 2        | 11      | 5.00        | 2018 |
| 1/10                    | 2018-09-10 | 1        | 10      | 10.00       | 2018 |

The below code shows how to do it all...

df['CustomerKeyIndex'] = df['CustomerKey']  
df['ProductKeyIndex'] = df['ProductKey']  
df = df.set_index(['CustomerKeyIndex', 'ProductKeyIndex'])

query = 'Year == 2018 and OrderDate <= "2018-08-01"'  
df['2018_YTD'] = df.query(query) \  
    .groupby(['CustomerKey', 'ProductKey'])['OrderAmount'] \
    .sum()

df = df[~df.index.duplicated(keep='first')]  # To get only a single Customer/Product combo

Repeat for any other years you're looking for.

And that actually takes just a few seconds across a few million rows. I'm sure there's other ways of doing it (ie. time series lags across one year), but they seemed a bit more complicated and this was quick enough and fairly straightforward.

Import a directory into Databricks using the Workspace API in Python

Kyle Valade — Fri, 07 Jun 2019 17:52:59 GMT

Another fairly easy thing that I couldn't find in the docs. I wanted to be able to upload a directory into my Databricks Workspace from my CI server so I could test the current branch.

Luckily enough, the databricks-cli library was written in Python, so we can just use that. But first you'll need to generate a token for yourself to use in the API. Of course, you need to follow the instructions to be able to use the API in the first place, but from there it's pretty straightforward.

from databricks_cli.workspace.api import WorkspaceApi  
from databricks_cli.sdk.api_client import ApiClient


client = ApiClient(  
    host='https://your.databricks-url.net',
    token=api_key
)
workspace_api = WorkspaceApi(client)  
workspace_api.import_workspace_dir(  
    source_path=base_path,
    target_path="/Users/user@example.com/MyFolder",
    overwrite=True,
    exclude_hidden_files=True
)

Read from SQL Server with Python/Pyspark in Databricks

Kyle Valade — Wed, 05 Jun 2019 13:39:35 GMT

This is actually really easy, but not something spelled out explicitly in the Databricks docs, though it is mentioned in the Spark docs. Alas, SQL server always seems like it's a special case, so I tend to discount things unless they mention SQL server explicitly. Not this time!

I'm guessing it's about as easy outside of Databricks if you're just running Pyspark. You'll just need to follow these docs and install the proper library - com.microsoft.azure:azure-sqldb-spark.

But if you're using Databricks already, you don't even have to do that. Admittedly the performance isn't great, but that could be due to a thousand other factors that I have not yet looked into.

jdbc_url = "jdbc:sqlserver://{host}:{port};database={database};user={user};password={password};UseNTLMv2=true".format(  
  host=host, port=port, database=database, 
  user=dbutils.secrets.get('database', 'username'), 
  password=dbutils.secrets.get('database', 'password'))

df = (spark.read  
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbTable", "MyTable")
  .load()
)

display(df)