English versionEN

Building Durable Workflows From Scratch in JavaScript

Durable workflows - workflows that checkpoint their state to automatically recover from failure - enable developers to build reliable code faster and dramatically reduce the severity of production incidents. Most workflow systems require you to set up a bunch of infrastructure, but that’s not necessary! In this talk, we’ll show you how to build durable workflows in pure JavaScript, as a library any application can import.

This talk has been presented at JSNation US 2025, check out the latest edition of this JavaScript Conference.

FAQ

Durable workflows in JavaScript are processes that regularly checkpoint the state of a program, allowing it to recover from failures by restoring from the last completed step, similar to save points in video games.

Building reliable systems is challenging because any step in a complex application can fail due to process crashes, resource limits, timeouts, external API failures, or AI brittleness, especially at large scales.

Durable workflows help recover from failures by checkpointing each step of a workflow. If a failure occurs, the system can use these checkpoints to restore the workflow from its last successful step, avoiding the need to restart from the beginning.

Implementing durable workflows in a Node.js application allows you to manage workflows directly within your app without relying on heavy external orchestration systems, reducing latency and complexity.

Checkpointing in durable workflows involves saving the state of each workflow step in a database, allowing the system to recover from failures by loading the workflow's last successful state.

Durable workflows prevent issues like data duplication by ensuring that each step is checkpointed and completed before proceeding, so in the event of a failure, steps are not repeated unnecessarily.

The two main methods in the durable workflow JavaScript library are 'register workflow', which registers a function as a workflow, and 'run step', which executes a function as a step in the workflow.

The library handles workflow recovery by identifying pending workflows that didn't finish, re-running them from the last completed step using checkpointed data, and continuing execution from there.

Using a lightweight durable workflow library in JavaScript allows developers to integrate reliable workflow management directly into their applications without needing complex external infrastructure, making it fast and efficient.

Yes, durable workflows can be used for a variety of applications including CI pipelines, data pipelines, and AI agents, providing reliable execution across different domains.

architecture

Peter Kraft

18 min

20 Nov, 2025

Comments

Video Summary and Transcription

Peter discusses building durable workflows in pure JavaScript, highlighting challenges in creating reliable systems, especially in complex applications like Money Transfer. Failures can occur due to various reasons, including process crashes, resource issues, timeouts, and API failures. Durable workflows involve checkpointing for recovery, traditionally involving heavyweight external orchestration systems. Designing a JavaScript workflow library aims for simplicity and durability, enabling checkpointing directly on app servers. Implementing the library includes checkpoints for safety, recovery, and workflow integrity maintenance.

Available in Español: Construyendo Flujos de Trabajo Duraderos Desde Cero en JavaScript

1. Building Durable Workflows in JavaScript

Short description:

Hey, I'm Peter, and today I want to talk to you about how you can build durable workflows from scratch in pure JavaScript. So all of you are developers, and you know how hard it is to build truly reliable systems. It's hard to make your applications reliable because when you're building a complex application, just about any step can break anywhere. If you're building something like Money Transfer, for example, you have to worry about your application breaking, like when you start transferring, while you're in the middle of transferring, at the end, when you're trying to send a confirmation. And things can break for just about any reason.

Applications can break and fail because they're process crashes, because they run out of resources or hit a bug. They can fail because of timeouts, because someone takes too long to respond. They can fail because an external API they use breaks down, because it's rate limited, because it transiently fails, because it has an outage. And, of course, failures are an even bigger worry if you're running at large scale, where there are just more things that can fail. Or if you're using AI, which a lot of folks are these days, because AI is kind of inherently brittle, and the AI providers, the model providers that we rely on, frequently have issues and outages that can break AI applications.

If something fails, the easy thing to do is to retry from the beginning, to restart whatever failed. But often you can't do that. If you're restarting a business process, like a Money Transfer from the beginning, you're risking some kind of corruption or duplication, where you transfer money twice, or double book a reservation. If you're trying to restart a really big task, then you're risking wasting compute resources or being incredibly slow. So often, when instead of just restarting something that fails, you have to implement complicated recovery code that figures out exactly what the failure is, and tries to remediate it directly. And that sort of thing is hard to write.

2. Implementing Lightweight Durable Workflows

Short description:

Durable workflows involve checkpointing to recover programs in case of failure, akin to save points in video games. Useful for various applications, they are complex to integrate into JavaScript apps. Traditional durable workflow architecture involves external orchestration with heavyweight systems, high dependencies, and latency overhead.

So one new tool that can really help with this are durable workflows. So the idea behind durable workflows is that you regularly checkpoint the state of your program. So that if something fails, you can use those checkpoints to recover your program from its last completed step. And you can think of workflows and their checkpoints as working a lot like save points in a video game, where if you're playing a game, you can save regularly, so that if you die, you can reload from the last save. And when you're running a durable workflow, you're checkpointing every step, so that if your program fails, you can reload, resume, recover from the last completed step.

So these sorts of workflows are really useful for all sorts of applications, for important business processes, for CI pipelines, for data pipelines, and nowadays also for agents. So there are plenty of workflows out systems out there, but most of them are kind of big and complicated. They're not easy to integrate into like a JavaScript application. The classic architecture for durable workflows is what I like to call external orchestration. And the idea here is that you have a central orchestrator surface that orchestrates your workflows. And if you want to run a workflow, now you run the workflow and not on your own JavaScript application, but on the service.

And then the way they work is that your server send a request to the workflow service to start a workflow. The workflow service dispatches the first step in the workflow to a worker. The worker executes the task. The service dispatches the second step and so on until every step has been dispatched and the workflow is done. These are big heavyweight systems and they work, but they require you to take on large dependencies, add new services, re-architect your application. They also can have high latency overhead from step dispatch. You might be adding tens or hundreds of milliseconds to the latency of each of your steps.

3. Designing a JavaScript Workflow Library

Short description:

Workflows should be simple to integrate into your node app without heavy infrastructure. Implement workflows as a JavaScript library for durable and reliable workflows. Design a library that allows checkpointing workflows and steps directly on app servers for scalability and power.

You might be adding tens or hundreds of milliseconds to the latency of each of your steps. So effective systems, but heavyweight and maybe not something you want to add to your backend node application. So what I want to tell you is that workflows should be simple. Your workflow should be something you can just put into your node app without needing all this infrastructure. And the workflows you build in your node app should be just as reliable and powerful as the ones that are built and scalable as the ones built using all this infrastructure.

And specifically, I'm going to show you how you can implement workflows as a JavaScript library, as a library that you can install into your application. You can npm install this library and this library will give you your application durable workflows. You'll be able to checkpoint workflows and steps directly from your app servers, just as you would with a big complicated cluster. So this is the architecture. We're going to design this library together. This is what the architecture of this library looks like, where you have the library installed with your app servers and it runs the workflows directly on your servers and checkpoints their workflows and steps to a database.

So here's what the interface of this library that we're going to design looks like. It's actually pretty simple. You write workflows and steps as normal JavaScript functions. Then there are only two methods in our library's interface and we're going to design both methods. The first is register workflow, which registers an ordinary JavaScript function as a workflow. And the other is run step, which runs an ordinary JavaScript function as a step in a workflow. We'll build these two interfaces and from these, we'll get durable workflows. And the way this library is going to work is that it's going to checkpoint workflows and steps in a database. So when a workflow starts, the library will checkpoint the workflows inputs. And then as each step of the workflow completes, the library will checkpoint that step's outcome to a database.

4. Implementing Workflow Checkpoints

Short description:

The library checkpoints the workflow output for recovery. Implementing a durable workflow library in JavaScript includes methods to run workflows and steps with checkpoints for safety and recovery. Generating unique IDs, checkpointing inputs, running functions, and checkpointing outputs are key steps in the process.

And then finally, when the workflow finishes, the library will checkpoint the workflows output. And because the workflows input and every step is checkpointed, if something ever fails or breaks or crashes, the library can look up the checkpoint to recover each workflow from its last completed step.

So let's start implementing this library together in JavaScript. And it's simple enough that we can fit most of the code, at least a simplified version of the code right here on these slides. So let's start with the first method. Well, we'll look at three methods. And the first one is run workflow. And this method is going to take a JavaScript function, the workflow, and its arguments, and run this function as a durable workflow, checkpointing it and its steps so they can be safely recovered. And I know this block of code might look scary, but we're going to go through it line by line to see exactly what it does.

So the first step in running a workflow is generating a globally unique ID for the workflow that can be used to reference it. Then we're going to checkpoint the workflows, we're going to checkpoint the workflows inputs. So we're going to write our records the database with this workflows ID, its inputs and the pending status. So we now have a record saying that this workflow has started with these inputs and this unique ID. And we can use that record later to recover the workflow if anything breaks. Now that we've, you know, durably persisted a record saying that the workflow started, we can now actually run the workflow function. So the workflow function, just an ordinary JavaScript function, we're going to call it with its original arguments.

And you checkpoint its outputs. Of course, we don't just want to have checkpoints at the beginning and at the end. We want to have some in the middle too. So the next function we have to implement is run step. And this will checkpoint every step in a workflow so that we can checkpoint every step the workflow takes. And if something breaks, you'll recover from the last complicated step. So the first thing you do, this is run step. It takes in a step, which can be any JavaScript function and the steps arguments, and it durability runs and checkpoints the step. So the first thing you do in run step is you retrieve the workflows context. So you retrieve the ID of the workflow and the ID of the step, which is just an index into the number of steps this workflow is running.

5. Workflow Step Retrieval and Recovery

Short description:

To run a step, retrieve workflow and step IDs, check for existing checkpoints, execute the step function, and write the output. Implement a workflow recovery function to recover failed workflows using checkpoints. Get pending workflows, retrieve checkpointed inputs, and rerun workflows through run workflow to resume execution.

So the first thing you do in run step is you retrieve the workflows context. So you retrieve the ID of the workflow and the ID of the step, which is just an index into the number of steps this workflow is running. So you retrieve workflow ID and step ID from context, which might be from a context argument you're passing around or from a local storage. Then after you have your workflow ID and step ID, next, you check if a checkpoint already exists for the step because maybe you're recovering from a failure and the step already ran and we want to retrieve a checkpointed output instead of re-running it. So we go look up in the database, whether a checkpoint exists for this workflow ID and step ID. And if one does, we'll return that checkpoint directly instead of re-executing the workflow. If there's no step, if there's no checkpoint, then we'll just directly execute the step function, calling it on its original arguments and getting its output. And then we'll write this output to the database.

So to sum it all up, when you want to run a step, you're going to first check of a checkpoint that already exists for the step. If one does, return a checkpoint directly, otherwise run the step and check what is the outcome. So now we're actually almost done implementing our little library. The last step, the last thing we need is a way to recover workflows that have previously failed. Using these checkpoints, we just spent so much time building. So here's what our workflow recovery function looks like. And again, pretty simple. This function is going to run every time a program with our library starts up.

The first thing it'll do is it'll get a list of all pending workflows, all workflows that started but didn't finish. These are workflows presumably that failed and we're going to go recover them. We then iterate through each one of these failed workflows and forge these workflows. We're going to retrieve its checkpointed input. Remember how we checkpointed earlier? And we're going to look up the actual workflow function from an in-memory map that we created when each workflow was registered. Now that we have the workflow's unique ID, its arguments and the function itself, we're just going to call run workflow, through which we did earlier on the workflow function, the unique ID and the arguments. And run workflow, we'll just run the workflow again from the beginning.

6. Durable Workflow Recovery and Checkpoints

Short description:

Recovery through checkpointing, workflow execution, and recovery process with durable workflows. Example of a JavaScript workflow for a checkout service and the importance of checkpoints in maintaining workflow integrity.

And run workflow, we'll just run the workflow again from the beginning. But each time the workflow gets to a step, it'll see that the step has already been checkpointed. And instead of re-executing the step, it will return the step's checkpointed output. And by doing this over and over again, the workflow will eventually recover, it'll eventually get to the first step that doesn't have a checkpoint and resume normal execution from that point. So in other words, our workflow will recover from the last completed step by following this procedure. Great.

So now we can see roughly what our library looks like. One last thing I want to show you is a diagrammatic process of what this library might do when applied to a real application. So we're going to look at an example of a workflow. So a checkout service, a simplified checkout service of the sort that might be, that you might run when you click the buy button on a website. So this workflow would be implemented as a JavaScript function. And each of these steps like create order, reserve inventory, process payment, would also be JavaScript functions. So when we run this workflow, we call run workflow.

The first thing we do is checkpoint the inputs. So we'll write a record to the database containing the workflow inputs, the unique ID, and mark the workflow with a pending status. Then once we've written that checkpoint, we can start actually executing the workflow. So we create an order, we reserve inventory, we process payment, and after each step, we checkpoint the steps outcome. Then let's say something breaks. We have just processed payment, and suddenly our server fails. And our process isn't running anymore. Normally, that would be pretty bad because our customers paid for something, and they're expecting to get their order fulfilled, but we don't know that because our process just crashed.

Recovering that might be tricky, but we have durable workflows, so it's much easier. So when our server resumes, it'll look at the database and see that there's this workflow that's in a pending state, so it needs to be recovered. It'll then recover the workflow, looking up each of its steps, and returning the checkpointed outcome. So it'll see that the create order, reserve inventory, and process payment steps were all successful. So it'll fast forward past them, reload from their checkpoints, and now re-execute them. Then finally, we'll get to the first, not yet completed step, fulfill order. We'll run it, we'll fulfill the order, we'll make the customer happy, and then we'll mark the workflows as success, and we'll do that all without re-executing anything, without double preserving inventory, or double charging the customer.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Scaling Up with Remix and Micro Frontends

Remix Conf Europe 2022

23 min

Scaling Up with Remix and Micro Frontends

Top Content

Adrien Baron

Creator of Tiny Frontend

This talk discusses the usage of Microfrontends in Remix and introduces the Tiny Frontend library. Kazoo, a used car buying platform, follows a domain-driven design approach and encountered issues with granular slicing. Tiny Frontend aims to solve the slicing problem and promotes type safety and compatibility of shared dependencies. The speaker demonstrates how Tiny Frontend works with server-side rendering and how Remix can consume and update components without redeploying the app. The talk also explores the usage of micro frontends and the future support for Webpack Module Federation in Remix.

remix javascript micro-frontends architecture

Understanding React’s Fiber Architecture

React Advanced 2022

29 min

Understanding React’s Fiber Architecture

Top Content

Tejas Kumar

Author of the "Fluent React" bestselling book, software engineer with 23 years of experience, and host of the developer-loved ConTejas Code podcast.

This Talk explores React's internal jargon, specifically fiber, which is an internal unit of work for rendering and committing. Fibers facilitate efficient updates to elements and play a crucial role in the reconciliation process. The work loop, complete work, and commit phase are essential steps in the rendering process. Understanding React's internals can help with optimizing code and pull request reviews. React 18 introduces the work loop sync and async functions for concurrent features and prioritization. Fiber brings benefits like async rendering and the ability to discard work-in-progress trees, improving user experience.

react architecture concurrent rendering react 18 beginner friendly react fiber react reconciliation

Thinking Like an Architect

Node Congress 2025

31 min

Thinking Like an Architect

Top Content

Luca Mezzalira

Author of Front-End Reactive Architectures and Building Micro-Frontends

In modern software development, architecture is more than just selecting the right tech stack; it involves decision-making, trade-offs, and considering the context of the business and organization. Understanding the problem space and focusing on users' needs are essential. Architectural flexibility is key, adapting the level of granularity and choosing between different approaches. Holistic thinking, long-term vision, and domain understanding are crucial for making better decisions. Effective communication, inclusion, and documentation are core skills for architects. Democratizing communication, prioritizing value, and embracing adaptive architectures are key to success.

architecture

Full Stack Components

Remix Conf Europe 2022

37 min

Full Stack Components

Top Content

Kent C. Dodds

Creator of EpicWeb.dev, EpicReact.Dev, TestingJavaScript.com

RemixConf EU discussed full stack components and their benefits, such as marrying the backend and UI in the same file. The talk demonstrated the implementation of a combo box with search functionality using Remix and the Downshift library. It also highlighted the ease of creating resource routes in Remix and the importance of code organization and maintainability in full stack components. The speaker expressed gratitude towards the audience and discussed the future of Remix, including its acquisition by Shopify and the potential for collaboration with Hydrogen.

remix javascript fullstack architecture

The Dark Side of Micro-Frontends

React Advanced 2025

29 min

The Dark Side of Micro-Frontends

Luca Mezzalira

Author of Front-End Reactive Architectures and Building Micro-Frontends

In the Talk, various key points were discussed regarding micro-front-end architecture. These included challenges with micro-intents, common mistakes in system design, the differences between micro-intents and components, granularity in software architecture, optimizing micro-front-end architecture, efficient routing and deployment strategies, edge computing strategies, global state and data sharing optimization, managing data context, governance and fitness functions, architectural testing, adaptive growth, value of micro-frontends, repository selection, repo structures, and web component usage.

architecture

The Eternal Sunshine of the Zero Build Pipeline

React Finland 2021

36 min

The Eternal Sunshine of the Zero Build Pipeline

m4dz

DX Engineer at alwaysdata

For many years, we have migrated all our devtools to Node.js for the sake of simplicity: a common language (JS/TS), a large ecosystem (NPM), and a powerful engine. In the meantime, we moved a lot of computation tasks to the client-side thanks to PWA and JavaScript Hegemony.
So we made Webapps for years, developing with awesome reactive frameworks and bundling a lot of dependencies. We progressively moved from our simplicity to complex apps toolchains. We've become the new Java-like ecosystem. It sucks.
It's 2021, we've got a lot of new technologies to sustain our Users eXperience. It's time to have a break and rethink our tools rather than going faster and faster in the same direction. It's time to redesign the Developer eXperience. It's time for a bundle-free dev environment. It's time to embrace a new frontend building philosophy, still with our lovely JavaScript.
Introducing Snowpack, Vite, Astro, and other Bare Modules tools concepts!

build tools vite architecture programming concepts

Workshops on related topic

AI on Demand: Serverless AI

DevOps.js Conf 2024

163 min

AI on Demand: Serverless AI

Top Content

Featured WorkshopFree

Nathan Disidore

In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.

serverless architecture artificial intelligence

React and Microfrontends

React Summit US 2024

56 min

React and Microfrontends

Featured Workshop

Harsh Maheshwari

Mentorship available

Leveraging reactjs to create reusable microfrontends addressing challenges and common pitfalls.

architecture

High-performance Next.js

React Summit 2022

50 min

High-performance Next.js

Workshop

Michele Riva

Next.js is a compelling framework that makes many tasks effortless by providing many out-of-the-box solutions. But as soon as our app needs to scale, it is essential to maintain high performance without compromising maintenance and server costs. In this workshop, we will see how to analyze Next.js performances, resources usage, how to scale it, and how to make the right decisions while writing the application architecture.

performance next.js best practices architecture

Model Context Protocol (MCP) Deep Dive: 2-Hour Interactive Workshop

AI Coding Summit

86 min

Model Context Protocol (MCP) Deep Dive: 2-Hour Interactive Workshop

Workshop

Stepan Suvorov

Join a focused 2-hour session covering MCP's purpose, architecture, hands-on server implementation, and future directions. Designed for developers and system architects aiming to integrate contextual data with ML models effectively. Agenda:- Introduction & Why MCP? Key challenges MCP solves and core benefits.- Architecture Deep Dive: components, interactions, scalability principles. - Building Your Own MCP Server: guided walkthrough with code snippets and best practices; live demo or code review.- Future of MCP Developments: potential enhancements, emerging trends, real-world scenarios.
Key Takeaways:- Clear understanding of MCP's rationale.- Insight into design patterns and scaling considerations.- Practical steps to implement a prototype server.- Awareness of upcoming trends and how to apply MCP in projects.

architecture