The Case for Python in Excel

Author:Murphy  |  View: 20879  |  Time: 2025-03-22 21:55:18
Image by Author – Created using DALL-E 3

My first reaction to Microsoft's announcement of Python in Excel (PiE) last year was a positive one. Excel has a dominant presence in the enterprise and Python continues to enjoy popularity as a programming language for data scientists. Bringing these two tools together with the support of both Microsoft and its partner Anaconda is bound to result in something good!

As full disclosure, I am a former Microsoft employee. I am also a devotee of Anaconda's Python ecosystem services and resources. Admittedly, I viewed the PiE announcement through a biased lens.

Upon further reflection, however, I started to question the value of making Python available in Excel. What, if anything, does it bring to the vast majority of happy Excel users? And why would a Python data scientist want to integrate with Excel? Is PiE just a solution looking for a problem to solve?

As Matt Houseley I think rightly points out in his post, PiE can streamline the collaboration process between data scientists and Excel users. PiE can also lead to greatly enhanced data insights for those Excel users. For the Python data scientist, PiE can reduce the time and effort they must spend collaborating with and responding to end users. The streamlined collaboration with better data insights value proposition is, I think, the winning one for PiE.

In this article, I share the results of trying PiE using data that I have used in other research projects. My goal was to gauge the ease of use and effectiveness of PiE using real-life data. I then discuss some potential use cases for PiE and provide a short list of additional features that are needed to make PiE useful in more complex scenarios.

All charts, screen shots, images and code are by the author unless otherwise indicated.

The Test Scenario

The City of New York makes its 311 citizen service request data freely available to the public in order to promote transparency into the workings of the city. The data includes millions of records covering hundreds of different types of service requests. For my test, I wanted to focus on a set of service requests related to living conditions. Putting myself into the persona of a city official responsible for deploying resources for handling living conditions complaints, my goal was to see if I could forecast the number of new requests that the city would receive each day up to 14 days in the future.

Instructions for installing and using Python in Excel can be found here. My Excel workbook for this project, PiE_Test.xlsx, along with additional supporting material can be downloaded from this site.

Downloading the 311 data from the NYC site is straightforward, though it may take a few attempts given the enormous amount of 311 data. Getting the data into the format I needed to support my scenario took a few steps. In its current preview form, PiE can only work with data that is already present in an Excel worksheet. It cannot access any external data.

Excel has a limit of 1,048,576 records per worksheet. The NYC data I downloaded included all services requests up to mid-October, 2023. That amounts to over 34,000,000 records. I had to find a way to get a representative sample of the records that fit within Excel's 1M+ record worksheet limit.

I used the built-in Excel Power Query tool to accomplish the task. The query editor is quite flexible and allowed me to repeatedly skip a given number of records and then take a small number of contiguous records from the downloaded file to include in my excel sheet. While not a true random sample, the end result was a sheet that contained 1M records of a wide variety request types from early July, 2023 through mid-October, 2023. Power Query also allowed me make data type changes, such as changing the type of the date created field from text to an Excel date type.

Using a cross-reference table that I constructed in a separate worksheet, I assigned a "living conditions" high-level category to the subset of 311 request types that I felt a person would think are related to their living conditions. I used a standard Excel VLOOPUP() function to assign the category to the individual request records.

The request types that I selected for "living conditions" include:

<script src="https://gist.github.com/jbraun22/87484d989dbf8b87e1df2c1a04f72c57.js"></script>

A total of 280,313 service request records were assigned to the "living conditions" category. I constructed a simple Excel pivot table over these request records to get the total number of living conditions requests opened by date from early July to mid-October. That produced a sheet of 108 records from July 6, 2023 through October 21, 2023.

Here is an Excel plot of the data:

311 Living Conditions Requests by Date

There was an observation for every date in my selected date range, so I did not need to impute any missing values. As you can see above, there was a spike in requests on October 29. That was due to severe storms that hit New York that day, causing flooding. I treated this as an outlier and used Excel to substitute the mean of October 28th's and October 30th's request counts for the original request count value for October 29.

As an experiment, I compared the results from one of Excel's built-in time-series forecasting models with the results produced using the statsmodels library that is included in the PiE python environment. Would there be a noticeable difference in predictive power?

Starting with Excel, I created a forecast using the FORECAST.ETS() function. This function uses a exponential smoothing algorithm that includes adjustments for seasonal data. As the above plot shows, our data displays a strong weekly seasonal pattern so statistical algorithms that take seasonality into account are likely to be better predictors.

You can view the data and FORECAST.ETS() function I used in the counts_by_date sheet of the PiE_Test.xlsx workbook in the repository for this article. Here is an Excel graph of the results of the FORECAST.ETS() forecast:

14 day prediction of Living Condition service request from Excel FORECAST.ETS() function

Not a bad result. The predicted values follow the seasonal pattern. The mean absolute percent error of the predictions is 10.67%.

Next, I used Python in Excel to generate the same predictions, this time using the SARIMAX model from the STATSMODELS library. SARIMAX is an auto-regressive moving-average statistical algorithm that factors in seasonal patterns. STATSMODELS is one of the Python libraries currently supported by PiE.

To use PiE, you place your Python code in special Python-enabled cells. For readability and ease of testing and debugging, you can break code across multiple cells. They will be executed in a left-to-right then top down order. See the the PiE worksheet in the PiE_Test.xlsx workbook to see the Python cells I created along with the SARIMAX prediction results and the Matplotlib -generated graph.

PiE does not execute any Python code locally on your system. All code is sent to and executed in a Python environment in the Azure cloud. With round-trip network latency, and depending on the size of your data and complexity of Python operations, it can take a few minutes for your Python code to execute.

Python code in cell A4 shown in the Excel edit box

I ran SARIMAX parameter optimization tests on my data in a separate Jupyter notebook (also in the repository) and used the optimal parameters found by those tests in the Python cell for my SARIMAX model. While I could have also included the parameter test code in the workbook, it would have introduced extra recompute time that I did not want to endure.

Here is the forecast generated by SARIMAX:

14 day prediction of Living Conditions request from SARIMAX model

This forecast has the same pattern as Excel's FORECAST.ETS() forecast, following the pattern of actual values well. The mean absolute percentage error of SARIMAX's predictions is 11.00%, a bit higher than FORECAST.ETS()'s MAPE of 10.67%.

An advantage of SARIMAX is that it allows the addition of exogenous features to the model (hence the "X" at the end of SARIMAX). You cannot do that with FORECAST.ETS() or Excel's built-in TREND() linear regression time series prediction model. I would expect that adding data like historical daily temperature and precipitation forecasts as exogenous features could improve the accuracy of SARIMAX, achieving better results that FORECAST.ETS(). Given sklearn is also a library available in PiE, machine learning models such as gradient boosting and random forests could also be tried to see if they produce more accurate predictions. It is these libraries among others in the python ecosystem that would allow data scientists to provide enhanced data insights to Excel end users.

Potential Use Cases

Overall, I found PiE easy to use. My test convinced me that PiE provides value by extending the capabilities of Excel, enabling Excel users to tackle more sophisticated Data Science tasks in a way that does not require them to be proficient in Python.

Just a few use cases for PiE that come to mind include:

  • Creating enhanced models for regression, classification, and time-series forecasts that go beyond the models that can be created in native Excel;
  • Text analysis of string data in Excel using the Gensim NLP library (already included in PiE);
  • Improving data pre-processing; for example, imputing missing values in unbalanced data;
  • Enhanced random sampling of data, included weighted samples of unbalanced data;
  • More nuanced identification of outliers;
  • Improved interactivity in visualizations, beyond Excel's dropdowns and tooltips; and,
  • Improved geographic plotting, for example plotting geocoded points in addition to just plotting geographic areas.

Needed Features

As mentioned earlier, I believe the most compelling business scenario for PiE is helping Python-fluent data scientists provide enhanced insights to an Excel-savvy user base. By incorporating Python routines into Excel workbooks, the data scientist can give Excel users self-service access to a broad range of more complex features and models.

But, to make PiE usable by data scientists under this scenario, more features must be added and some limitations removed. My top three list of feature requests are as follows:

Jeff's Desired Feature Enhancements

Conclusion

I started this article toward the end of 2023. Though there was positive buzz after the announcement of PiE and its preview availability late in August of 2023, it became clear to me as I worked on the article that interest was waning. Little was being written, and activity on the PiE support forums can best be described as sleepy. I lost interest.

Still, every so often, I came back to Python in Excel, thinking again about its potential usefulness and business viability. This is the type of solution that can take awhile to reach critical mass in the market, but my sense is it would be a "sticky" solution once organizations start to adopt it. So, in honor of Medium's Draft Day 2024, I revisited and polished up the article and pushed Publish. It felt good.

Tags: Data Science Draft Day 2024 Programming Python In Excel Time Series Analysis

Comment