Calculate grouped YTD totals for previous years in Pandas

I want to calculate the YTD total for the last couple years for every customer + product combination that has been sold. I'm new to Pandas and actually spent kind of a long time on this problem, but the solution turned out to be pretty simple. Assume today is August 1, 2019.

I have data that looks like...

|   | OrderDate  | Customer | Product | OrderAmount |
|---|------------|----------|---------|-------------|
| 1 | 2018-02-10 | 1        | 10      | 10.00       |
| 2 | 2018-05-11 | 2        | 11      | 5.00        |
| 3 | 2018-09-10 | 1        | 10      | 10.00       |  # Don't include in YTD!

At the end, I want a dataframe that looks something like this:

|   | Customer | Product | 2018_total | 2017_total |
|---|----------|---------|------------|------------|
| 1 | 1        | 10      | 10.00      | 0          |
| 2 | 2        | 11      | 5.00       | 0          |

And it has to be performant because there's a lot of data. So iterrows is out, as is groupby().apply(), because that thing is ungodly slow (it was taking real seconds per group).

What I ended up doing was creating a year column (I cheated and got it from the DB), copying the columns that I wanted to index into new columns (probably cuz I'm a noob), and then just doing a df.query().groupby().sum() into a new column.

Now obviously you don't need a year - you could just do a x < y < z, but the year helped for other things, so it's staying, dammit.

So now our dataset looks like...

| Index(Customer/Product) | OrderDate  | Customer | Product | OrderAmount | Year |
|-------------------------|------------|----------|---------|-------------|------|
| 1/10                    | 2018-02-10 | 1        | 10      | 10.00       | 2018 |
| 2/11                    | 2018-05-11 | 2        | 11      | 5.00        | 2018 |
| 1/10                    | 2018-09-10 | 1        | 10      | 10.00       | 2018 |

The below code shows how to do it all...

df['CustomerKeyIndex'] = df['CustomerKey']  
df['ProductKeyIndex'] = df['ProductKey']  
df = df.set_index(['CustomerKeyIndex', 'ProductKeyIndex'])

query = 'Year == 2018 and OrderDate <= "2018-08-01"'  
df['2018_YTD'] = df.query(query) \  
    .groupby(['CustomerKey', 'ProductKey'])['OrderAmount'] \
    .sum()

df = df[~df.index.duplicated(keep='first')]  # To get only a single Customer/Product combo

Repeat for any other years you're looking for.

And that actually takes just a few seconds across a few million rows. I'm sure there's other ways of doing it (ie. time series lags across one year), but they seemed a bit more complicated and this was quick enough and fairly straightforward.

Menu