Calculate grouped YTD totals for previous years in Pandas
I want to calculate the YTD total for the last couple years for every customer + product combination that has been sold. I'm new to Pandas and actually spent kind of a long time on this problem, but the solution turned out to be pretty simple. Assume today is August 1, 2019.
I have data that looks like...
| | OrderDate | Customer | Product | OrderAmount |
|---|------------|----------|---------|-------------|
| 1 | 2018-02-10 | 1 | 10 | 10.00 |
| 2 | 2018-05-11 | 2 | 11 | 5.00 |
| 3 | 2018-09-10 | 1 | 10 | 10.00 | # Don't include in YTD!
At the end, I want a dataframe that looks something like this:
| | Customer | Product | 2018_total | 2017_total |
|---|----------|---------|------------|------------|
| 1 | 1 | 10 | 10.00 | 0 |
| 2 | 2 | 11 | 5.00 | 0 |
And it has to be performant because there's a lot of data. So iterrows
is out, as is groupby().apply()
, because that thing is ungodly slow (it was taking real seconds per group).
What I ended up doing was creating a year column (I cheated and got it from the DB), copying the columns that I wanted to index into new columns (probably cuz I'm a noob), and then just doing a df.query().groupby().sum()
into a new column.
Now obviously you don't need a year - you could just do a x < y < z
, but the year helped for other things, so it's staying, dammit.
So now our dataset looks like...
| Index(Customer/Product) | OrderDate | Customer | Product | OrderAmount | Year |
|-------------------------|------------|----------|---------|-------------|------|
| 1/10 | 2018-02-10 | 1 | 10 | 10.00 | 2018 |
| 2/11 | 2018-05-11 | 2 | 11 | 5.00 | 2018 |
| 1/10 | 2018-09-10 | 1 | 10 | 10.00 | 2018 |
The below code shows how to do it all...
df['CustomerKeyIndex'] = df['CustomerKey']
df['ProductKeyIndex'] = df['ProductKey']
df = df.set_index(['CustomerKeyIndex', 'ProductKeyIndex'])
query = 'Year == 2018 and OrderDate <= "2018-08-01"'
df['2018_YTD'] = df.query(query) \
.groupby(['CustomerKey', 'ProductKey'])['OrderAmount'] \
.sum()
df = df[~df.index.duplicated(keep='first')] # To get only a single Customer/Product combo
Repeat for any other years you're looking for.
And that actually takes just a few seconds across a few million rows. I'm sure there's other ways of doing it (ie. time series lags across one year), but they seemed a bit more complicated and this was quick enough and fairly straightforward.