Stop Using Spark for Your Small Data - Why Azure Functions is the Right Tool for the Job

As a data analyst, my job is to get data from A to B, cleaned and ready for use. A common workflow for my team involves users uploading Excel files to a OneDrive folder. A Power Automate flow then syncs these files daily to a container in our Azure Storage Account.

From there, my responsibility begins: 1. Read the new Excel file from Blob Storage using Python. 2. Process the data (clean, transform, apply business logic). 3. Write the final data to an Azure SQL Database.

I needed this to run on two triggers: a time schedule (e.g., every morning at 7 AM) and an event-driven trigger (i.e., as soon as a new file lands in the container).

My first thought was to use the "big data" tools I'd heard of: Azure Databricks or Azure Synapse Analytics.

The "Big Tool" Trap

On the surface, Databricks and Synapse are perfect. - They let me write Python in a Notebook, which I'm very comfortable with. - They have easy-to-use trigger and monitoring tools.

I set up a proof-of-concept, and it worked. But I quickly realized a problem. My Excel files are 10MB, not 10TB.

Using a full Spark cluster (which is what both Databricks and Synapse Notebooks run on) was like using a sledgehammer to crack a nut. I was paying for a powerful, multi-node cluster (which took 5-10 minutes to "cold start") just to run a Python script that finished in 30 seconds. The cost was going to be far too high for such a simple task.

The "Right Tool": Azure Functions

After some research, I found the perfect tool for small-to-medium data tasks: Azure Functions. Azure Functions, when used on a "Consumption Plan," is a true "serverless" service. This means: - It's cheap: You get a generous free grant every month, and after that, you pay _only_ for the seconds your code is actually running. For my task, the cost is practically $0. - It's fast: It starts in seconds (or less), not minutes. - It's perfect for triggers: It has built-in triggers for exactly my needs (Timer and Blob Storage).

The (Small) Learning Curve

The one trade-off is that it's _slightly_ more complex than a notebook. You can't just write and run your code in a web browser. The modern, recommended workflow is to use Visual Studio Code (VS Code) to develop your code locally and then "deploy" (push) it to the cloud.

This "local development" workflow is a best practice. It means you have a copy of your code, can use source control (like Git), and can test everything on your machine before it goes live.

More Than Just Timers

My needs were simple, but Azure Functions has triggers for almost anything. The most popular ones include: - Timer Trigger: Runs on a schedule (e.g., 0 7 1 for 7 AM every Monday). - Blob Trigger: Runs when a new file is uploaded to a storage container. - HTTP Trigger: Runs when it receives a web request (creating a simple API). - Queue Trigger: Runs when a new message is added to a storage queue. You can see the full list on the official Microsoft Azure Functions Triggers and Bindings documentation.

Conclusion

Databricks and Synapse are amazing, powerful tools, but they are not the answer for everything. For our team's daily Excel processing, using them was costing us time and money.

By investing a little time to learn the VS Code + Azure Functions workflow, we built a solution that is faster, more efficient, and costs a fraction of the price. Don't pay for a Spark cluster when all you need is a 30-second Python script.

#azure #dataanalyst #functions #dataengineering

2025-11-18

Stop Using Spark for Your Small Data - Why Azure Functions is the Right Tool for the Job

The "Big Tool" Trap

The "Right Tool": Azure Functions

The (Small) Learning Curve

More Than Just Timers

Conclusion

Related Articles

Add Comments

Comments

Related Articles

10 Essential Data Science Algorithms & Techniques

Stuck in a Version Trap - How I Used Azure ML to Deploy an Azure Function

Active Directory (AD) vs Azure Active Directory (AAD)

How to Calculate a Dynamic Truncated Mean in Power BI Using DAX

SQL Performance Tuning: Best Practices for Faster Queries