Ammar Husain

Beyond Caching: Content Delivery Networks

Ammar Husain — Fri, 20 Mar 2026 12:41:17 GMT

Introduction

Consider a user in Australia browsing their social media feed to catch up with friends in Europe and America. The media shared by friends takes a considerable time to load despite the user having a reasonably fast internet connection — while the same content loads instantly for those browsing from within Europe.

Consider another user in America trying to watch a live concert in Europe on their device. The broadcast is interrupted briefly but frequently. However, for the European audience, the broadcast is seamless.

In both cases, users outside the geography faced delays in accessing content over the internet due to increased round-trip time and additional network hops. This happens despite users having reasonably fast internet connections and providers having servers with enough capability to serve traffic and withstand spikes.

To provide a fair user experience, content providers need to ensure the geographical disadvantage is blunted by serving content locally. This is the core problem that Content Delivery Networks (CDNs) solve — bringing content closer to the user by caching and serving it from geographically distributed edge servers.

Content Delivery Network (CDN)

Definition

Formally, a Content Delivery Network (CDN) is a geographically distributed network of proxy servers and corresponding data centers.

The primary purpose of a CDN is to provide content at high speed. Thus, it shouldn’t be considered as a replacement of a web host but a service to help traditional web host overcome various limitations.

Core Components

A typical CDN consists of below core components →

Origin Server → The primary server where the original, authoritative version of content resides. This server is the web host. Without a CDN, every user request would hit this server directly, regardless of their location.
Edge Servers → CDN cache servers deployed at the “edge” of the network, physically closer to end users. They store cached copies of content. When a user requests a resource, the nearest edge server serves it, drastically reducing round-trip time (RTT).
Point of Presence (PoP) → A PoP is a physical data center location housing a cluster of edge servers. Major CDN providers operate hundreds of PoPs worldwide. Each PoP serves users in its geographic vicinity — think of them as regional cache of content.
Internet Exchange Points (IXPs) → These are physical locations where different networks (ISPs, CDNs, cloud providers) interconnect and exchange traffic. CDNs strategically collocate at IXPs to peer directly with ISPs, minimizing network hops and improving delivery speed.

Traffic Management

A typical CDN utilizes below to manage traffic →

Global Server Load Balancing (GSLB) → A DNS-based mechanism that intelligently routes user requests to the optimal PoP based on factors like geographic proximity, server health, network congestion, and current load.
Selector (Request Routing) → The decision logic — often working alongside GSLB — that determines which edge server within a PoP handles a specific request, factoring in content availability, server capacity, and session affinity.

Key Concepts

Offloading → The percentage of requests served directly by edge servers without going back to the origin. A high cache-hit ratio (e.g., 95%) means significant offloading — reducing origin bandwidth, compute costs, and the risk of origin overload.
Footprint → Refers to the CDN’s global reach — the total number and distribution of PoPs, edge servers, and network capacity. A larger footprint means better coverage, lower latency for diverse user bases, and greater resilience against regional failures.

In Summary → Users hit a nearby edge server at a PoP (often at an IXP), routed there by GSLB/selectors, offloading traffic from your origin — all enabled by the CDN’s global footprint.

Type of CDNs

CDNs can be classified based on the networking techniques they use to route and deliver content →

Anycast-Based CDN → Uses Border Gateway Protocol Anycast routing, where the same IP address is announced from multiple geographically distributed PoPs. Thus, when a user sends a request, the network’s BGP routing directs the packet to the nearest (in terms of network hops/latency) server advertising that IP. This is simple, fast failover, resilient to DDoS attacks (traffic is naturally distributed) and utilized by Cloudflare, Google Cloud CDN.
DNS-Based CDN → Uses DNS resolution to direct users to the optimal edge server. Thus, a user request is resolved to a domain, the CDN’s authoritative DNS server returns the IP of the closest or least-loaded edge server based on the user’s location (via the resolver’s IP or EDNS Client Subnet). This provides fine-grained control over routing decisions (can factor in server load, geography, health). However, it suffers from DNS caching/TTL delays; routing is based on the DNS resolver’s location, not always the end user’s. Utilized by Akamai and Amazon CloudFront.
Unicast-Based CDN → Uses a unique IP address for each edge server, and traffic is directed via DNS or application-layer logic. The CDN’s control plane decides which specific server IP to hand back for a given request. Although this provides full control over which server handles which request, it requires more complex routing logic at the application/DNS layer.
Multicast-Based CDN → Uses IP Multicast to deliver the same content to multiple recipients simultaneously. A single stream is sent and replicated at network routers to reach all subscribers — avoids sending duplicate copies. This is extremely efficient for live streaming/broadcast scenarios. However, due to limited multicast support across the public internet it is mostly used within managed/private networks (IPTV, enterprise).
Peer-to-Peer (P2P) Hybrid CDN → Combines traditional CDN edge servers with P2P networking among end users. Users who have already downloaded content share chunks with nearby peers, reducing load on origin/edge servers. This scales massively for popular content; reduces bandwidth costs. However, it heavily depends on peer availability this latency can vary. Moreover, it has potential security/privacy concerns.
Application-Layer (Overlay) CDN → Builds a logical overlay network on top of the existing internet infrastructure, using application-layer routing. Edge servers communicate with each other through an optimized overlay topology (not relying on default BGP paths). Requests are routed through intermediate CDN nodes for optimal performance. It can optimize around congestion, packet loss, and suboptimal BGP routes with added complexity though. This also requires a sophisticated control plane.

Benefits & Use cases

CDN provides below benefits →

Reduced Latency → CDNs cache content on edge servers geographically closer to users, drastically reducing round-trip time.
High Availability & Redundancy → Traffic is distributed across multiple servers, so if one node fails, others handle requests seamlessly.
Scalability → CDNs absorb traffic spikes (e.g., flash sales, viral content) without overloading the origin server.
Bandwidth Cost Savings → Caching reduces the number of requests hitting the origin, lowering bandwidth and infrastructure costs.
Security → Many CDNs offer DDoS mitigation, WAF (Web Application Firewall), and TLS termination at the edge.
Improved SEO → Faster page loads positively impact search engine rankings.

Common Use Cases for CDN are →

Static asset delivery → Images, CSS, JavaScript, fonts, and videos (e.g. social media sites).
Video/audio streaming → Low-latency media delivery at scale (e.g., Netflix, YouTube).
Software distribution → Serving binaries, patches, and updates (e.g., OS updates, game downloads).
API acceleration → Caching API responses for read-heavy workloads.
E-commerce → Handling global traffic with consistent performance during peak events.

In short, CDNs are essential for any application that serves content to a geographically distributed audience and needs fast, reliable delivery.

However, its imperative to know When Not to Use a CDN →

Highly dynamic/personalized content → User-specific dashboards, real-time data feeds, or authenticated API responses gain minimal caching benefit.
Real-time applications → WebSocket connections, live gaming, or chat systems require persistent connections poorly suited to CDN architecture.
Geographically concentrated users → If your audience is near the origin server, a CDN adds unnecessary intermediary hops.
Sensitive/regulated data → Distributing confidential or compliance-bound content (e.g., healthcare, financial) across third-party edge servers raises security and legal concerns.
Small-scale projects → The operational complexity and cost outweigh performance gains for low-traffic applications.

Limitations

CDNs cache content at edge servers, but cache invalidation is complex — stale content can persist after updates. They add cost overhead (bandwidth fees, per-request charges) that may not justify the benefit for low-traffic sites.

CDNs offer limited control over edge server behavior and can introduce debugging complexity when issues arise across distributed nodes. They also have origin dependency — if your origin server fails, the CDN can only serve cached content until it expires.

Additionally, latency for cache misses can actually be higher than direct origin requests due to extra routing hops.

Conclusion

Content Delivery Networks have become a cornerstone of modern web architecture, ensuring that applications deliver fast, reliable, and secure experiences to users regardless of geography.

By caching content closer to end users, intelligently routing traffic, and providing resilience against spikes and failures, CDNs address the fundamental challenges of latency and scalability on the internet.

While they offer significant benefits — ranging from performance gains to cost savings and security enhancements — CDNs are not a one-size-fits-all solution. Their limitations, such as cache invalidation complexity and added operational overhead, must be carefully weighed against project needs.

For software engineers and architects, understanding when and how to leverage CDNs is critical to building systems that balance efficiency, reliability, and cost-effectiveness in a globally connected world.

References and Further Reads

Consensus in Distributed Systems: Understanding the Raft Algorithm

Ammar Husain — Sat, 14 Feb 2026 10:46:19 GMT

Introduction

Consider a group of friends planning a weekend outing. To make the trip successful, they need consensus on the location, schedule, and budget. Typically, one person is chosen as the leader — responsible for decisions, tracking expenses, and keeping everyone informed, including any new members who join later. If the leader steps down, the group elects another to maintain continuity.

In distributed computing, clusters of servers face a similar challenge — they must agree on shared state and decisions. This is achieved through Consensus Protocols. Among the most well-known are Viewstamped Replication (VSR), Zookeeper Atomic Broadcast (ZAB), Paxos and Raft. In this article, we will explore Raft — designed to be more understandable while ensuring reliability in distributed systems.

Consensus in Distributed Computing

Consensus in its simplest form refers to a general agreement. In the weekend outing analogy, it refers to all friends agreeing to a location. Its quite likely that several options are considered before the group eventually agree on a particular location.

In distributed computing too, one or more nodes may propose values. Of all these values one of it need to be agreed upon by all the nodes. Its up to the consensus algorithm to decide upon one of these values and propagate the decision to all the nodes.

Formally, a consensus algorithm must satisfy below properties →

Uniform agreement → All the nodes agree upon the same value — even if the node itself has proposed a different value initially.
Integrity → Once a value is agreed by the node, it shouldn’t change.
Validity → If a node agrees to a value, it must have been proposed at least by one another node too.
Termination → Eventually every participating node agrees upon a value.

The uniform agreement and integrity forms the core idea of consensus — everyone agree on same value, and once decided, its final.

The validity property ensures elimination of trivial behavior wherein a node agrees to a value irrespective of what has been proposed.

The termination property ensures fault tolerance. If one or more nodes fails the cluster should progress and eventually agree upon a value. This also eliminate the possibility of a dictator node which takes all decisions and jeopardies the whole cluster in case it fails.

Of course, if all the nodes fails the algorithm can’t proceed. There is a limit to number of failures a algorithm can tolerate. A algorithm that can correctly guarantee consensus amongst n nodes of which at most t fail is said to be t-resilient.

In essence termination property can be termed as liveness guarantee while rest three as safety guarantee.

Raft

Raft stands for Reliable, Replicated, Redundant, And Fault-Tolerant, reflecting its design principles in distributed systems. It ensures reliability by maintaining consistent logs, replication across nodes for durability, redundancy to avoid single points of failure, and fault tolerance to continue operating despite crashes or network issues. Together, these qualities make Raft a robust consensus algorithm for distributed computing.

Explanation

Raft utilizes leader approach to achieve consensus. In a Raft cluster a node is either a leader or a follower. A node could also be a candidate for a brief duration when a leader is unavailable i.e. leader election is underway.

The cluster has one and only one elected leader which is fully responsible for managing log replication on the other nodes of the cluster. It means that the leader can decide between it and the other nodes without consulting other nodes. A leader leads until it fails or disconnects, in which case remaining nodes elect a new leader.

Fundamentally thus the consensus problem is broken into two independent sub-problem in Raft as Leader Election and Log Replication.

Leader Election

Leader election in Raft occurs when the current leader fails or during initialization. Each election begins a new term, a time period in which a leader must be chosen. A node becomes a candidate if it doesn’t receive heart beats from a leader within the election timeout. It then increments the term, votes for itself, and requests votes from others. Nodes vote once per term, on first-come-first-served basis. A candidate wins if it secures a majority; otherwise, initiating a new term and election. Randomized timeouts reduce split votes by staggering candidate starts, ensuring quicker resolution and stable leadership through heartbeat messages.

Raft is not Byzantine fault tolerant; the nodes trust the elected leader, and the algorithm assumes all participants are trustworthy.

Log Replication

The leader manages client requests and ensures consistency across the cluster. Each request is appended to the leader’s log and sent to followers. If followers are unavailable, the leader retries until replication succeeds.

Once a majority of followers confirm replication, the entry is committed, applied to the leader’s own state, and considered durable. This also commits prior entries, which followers then apply to their own state, maintaining log consistency across cluster.

In case a leader crashes, inconsistencies may arise if some entries were not fully replicated. A new leader resolves this by reconciling logs. It identifies the last matching entry with each follower, deletes conflicting entries in their logs, and replaces them with its own. Thus ensuring consistency even after failures.

Additional Considerations

Raft algorithm has below additional consideration for a robust consensus algorithm for distributed computing.

Safety Guarantee

Raft ensure below safety guarantees →

Election safety → at most one leader can be elected in a given term.
Leader append-only → a leader can only append new entries to its logs (it can neither overwrite nor delete entries).
Log matching → if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
Leader completeness → if a log entry is committed in a given term then it will be present in the logs of the leaders since this term.
State safety → if a node has applied a particular log entry to its state , then no other node may apply a different command for the same log.

Cluster Membership Changes

Raft handles cluster membership changes using joint consensus, a transitional phase where both old and new configurations overlap.

During this phase, log entries must be committed to both sets, leaders can come from either, and elections require majorities from both. Once new configuration is replicated to a majority of its nodes, the system fully transitions.

Raft also addresses below three challenges →

New nodes without logs are excluded from majorities until caught up.
Leaders not in new configuration step down to followers.
Nodes still with old configuration that still recognize a leader ignore disruptive vote requests.

Log Compaction

Log compaction in Raft works by nodes taking snapshots of committed log entries, storing them with the last index and term. Leaders send these snapshots to lagging nodes, which then discard their log entirely or truncate it up to the snapshot’s latest entry. This also ensures durability in Raft.

Limitations of Raft

Raft has its own limitations with trades off scalability and flexibility as compared to other consensus algorithms.

Leader Bottleneck → Raft relies heavily on a single leader to coordinate log replication. If the leader fails, the system pauses until a new leader is elected, which can slow progress.
Scaling → Raft doesn’t scale well to very large clusters — leader elections and log replication become slower and riskier as the number of nodes grows.
Network partitions → this can cause temporary unavailability, since Raft prioritizes consistency over availability. An edge case exist where the elected leader is forced to resign and leadership switched between nodes continuously. Thus, forcing whole cluster to halt.

Real World Production Usage of Raft

Etcd uses Raft to manage a highly-available replicated log — utilized primarily in Kubernetes cluster for configuration management.
Neo4j uses Raft to ensure consistency and safety.
Apache Kafka Raft (KRaft) uses Raft for metadata management. In the recent version KRaft replaced Apache Zookeeper in Kafka.
Camunda uses the Raft consensus algorithm for data replication.

Raft vs Paxos

Raft was introduced to make consensus easier to understand and implement compared to Paxos. While Paxos is theoretically robust, it’s notoriously complex, making it hard for engineers to build reliable systems from it. Raft simplifies the process by breaking consensus into clear steps — leader election, log replication, and safety — without sacrificing correctness. This clarity makes Raft more approachable for real-world distributed systems.

When to Choose

Raft → Useful when building new distributed systems where clarity, maintainability, and developer adoption matter (e.g., databases, coordination services).
Paxos → Useful in academic or highly specialized systems where theoretical rigor is prioritized over ease of implementation.

In practice, Raft is usually the better choice for modern engineering teams because it balances correctness with simplicity.

Future Trends in Consensus

Future consensus algorithms are moving beyond leader-based models like Raft and Paxos. A key trend is leaderless consensus, where no single node coordinates decisions. Instead, all nodes collaborate equally, reducing the risk of a single point of failure. This makes systems more resilient and fair, especially in global networks where reliability is critical. For example, in blockchain or distributed databases, leaderless designs help ensure trust and consistency without relying on one “boss” node.

Another trend is scalability-focused consensus, which aims to cut down communication overhead. As systems grow to thousands of nodes, traditional methods struggle with efficiency. New protocols are exploring ways to minimize message exchanges while still guaranteeing agreement.

Also hybrid approaches are explored combining leaderless designs with probabilistic or quorum-based methods. These balance speed and fault tolerance, making them suitable for high-performance applications.

Finally, energy-efficient consensus is gaining attention, especially in blockchain, where proof-of-work is costly. Future algorithms will likely emphasize greener, lightweight mechanisms.

Consensus is evolving toward fairness, scalability, and sustainability — ensuring distributed systems can handle global scale without sacrificing reliability.

Conclusion

Raft simplifies the complex world of distributed consensus by breaking it into clear steps — leader election, log replication, and safety guarantees. While engineers may not encounter Raft every day, understanding it is essential when making architectural or design decisions for systems that demand reliability and consistency.

Raft ensures that clusters agree on shared state even in the face of failures, though it comes with trade‑offs like leader bottlenecks and limited scalability.

Its adoption in tools such as etcd, Kafka, and Neo4j shows its practical importance. Compared to Paxos, Raft is easier to grasp and implement, making it a strong foundation for modern distributed systems.

As consensus evolves toward leaderless and scalable designs, Raft remains a critical concept every architect should be aware of when shaping resilient, fault‑tolerant solutions.

References and Further Read

From Chaos to Order: Understanding Structured Concurrency in Java

Ammar Husain — Mon, 26 Jan 2026 06:46:18 GMT

Introduction

Typically, complexity in programming is managed by breaking down tasks into sub-tasks. These sub-tasks can then be executed concurrently.

Since Java 5, ExecutorService API helps programmer to execute these sub-tasks concurrently. However, given the nature of concurrent execution each of sub-task could fail or succeed independently with no implicit communication between themselves. The failure in one sub-task does not automatically cancel the other sub-tasks. Although an attempt can be made to manage these cancellation manually via external handling, its quite tricky to get right — especially when large number of sub-tasks are involved. This could potentially result in loose threads (alternatively know as thread leaks). Although Virtual Threads is a cost-effective way to dedicate a thread to each (sub)task, managing the results from them still remains a challenge.

Executor service allows one thread to create it and another thread to submit its tasks to this executor service. The threads which performs the tasks bear no relation to both of these threads. Moreover, a completely different thread can await on the result of execution from these sub-tasks via reference to its Future object — which is provided immediately upon task submission to executor service. Thus, effectively the sub-task does not have to return to the task that submitted it. It could possibly return to multiple threads or none.

Also, the relation between task & its sub-task is only logical and not visible in the code structure. There is no enforcement or tracking of task to sub-tasks relationship in runtime as well.

Structured Concurrency in Java aims to solve all of the above challenges by →

reliably automating the cancellation of sub-tasks; avoiding the thread leaks and delays.
ensuring a (sub)task return only to the thread which submitted it.
enforcing a structured relation between task and its sub-tasks — which could be nested as well.

Unstructured Concurrency

Lets understand more about current situation of unstructured concurrency in Java. Consider a function handle()which fetch user information and its associated order.

Response handle() throws ExecutionException, InterruptedException {
    Future  user  = esvc.submit(() -> findUser());
    Future order = esvc.submit(() -> fetchOrder());
    String theUser  = user.get();   // Join findUser
    int    theOrder = order.get();  // Join fetchOrder
    return new Response(theUser, theOrder);
}

At first glance the code seems simple which does what it intends to do. However, on closer look we could identify multiple issues →

If findUser() throws exception then the thread running fetchOrder() leaks as latter has no information or knowledge about former’s execution status.
If thread running handle()is interrupted then both the threads running findUser() and fetchOrder() leaks as these threads too have no knowledge of the thread which spawned them.
If findUser()took too long and meanwhile fetchOrder() fails, the failure would only be identified when order.get() is invoked.
Although the code is structured as task related to its sub-task, this relation is mere logical and is neither explicitly described at compile time nor enforced during runtime.

The first three situation arise due to lack of any automated mechanism for cancelling the other threads. This potentially leads to wastage of resources (as threads continue to execute), cancellation delays and at worst these leaked threads may interfere with other threads too. Although we may attempt to handle the cancellation manually, not only its tricky to get them right it complicates the overall program and create more room for errors.

The fourth situation arises due to lack of formal syntax which binds the threads into a parent-child relationship for their task to sub-task hierarchy. The Future object provided are unhelpful too in this case.

All these limitations are due to unstructured nature of concurrency via ExecutorService and Future which lacks in providing an automated way of cancellation or track task to sub-task relationships.

Structured Concurrency

Structured concurrency principles that

If a task splits into concurrent sub-tasks then they all return to the same place, namely the task’s code block.

In structured concurrency, the task awaits the sub-tasks’ results and monitors them for failures. Also using the APIs a well-defined entry and exit points for the flow of execution through a block of code is defined. APIs also help in enforcing a strict nesting of the lifetimes of operations in a way that mirrors their syntactic nesting in the code.

Structured concurrency has a natural synergy with virtual threads. A new virtual thread can be dedicated to every task, and when a task fans out by submitting sub-tasks for concurrent execution then it can dedicate a new virtual thread to each sub-task too. Moreover, the task-subtask relationship has a tree structure for each virtual thread to carry a reference to its unique parent.

While virtual threads deliver an abundance of threads, structured concurrency can correctly and robustly coordinate them. This also enables observability tools to display threads as they are understood by the developer.

StructuredTaskScope

In Structured Concurrency API, StructuredTaskScopeis the principal class.

The earlier example of handle() function rewritten with StructuredTaskScope is shown below

Response handle() throws InterruptedException {
    try (var scope = StructuredTaskScope.open()) {
        Subtask user = scope.fork(() -> findUser());
        Subtask order = scope.fork(() -> fetchOrder());
        scope.join();   // Join subtasks, propagating exceptions
        // Here both subtasks have succeeded, so compose their results
        return new Response(user.get(), order.get());
    }
}

With use of these APIs we achieve all of the below, which addresses the shortcomings of the unstructured concurrency discussed so far.

On failure of either findUser() or fetchOrder() other is automatically cancelled; in case not completed yet.
In case the thread running handle()is interrupted before or during invocation to join()both the sub-task viz. findUser() or fetchOrder() are cancelled; in case not completed yet.
The task structure and code mirror each other with abundant clarity. The scope is considered task (parent) while the fork are sub-tasks (children). The task waits for sub-tasks to complete or be cancelled and decide to succeed or fail with no overhead of lifecycle management.
An additional benefit earned via the hierarchy of task to sub-task is major improvement in observability. The call stack or thread dump clearly displays the relationship between the handle() to findUser() and fetchOrder() which can easily be understood by developer.

With the automatic cancellation/cleanup achieved with default completion policy, via zero-parameter open factory method, thread leaks are avoided altogether. The default completion policy is — fail whole scope if any sub-task fails. This policy can be customized or replaced with other available suitable policy too with StructuredTaskScope.Joiner— refer Completion Policies section later in this article for more details.

Before proceeding lets review few of the important characteristics of StructuredTaskScope a developer should be aware of →

As of JDK 26 this is still a preview feature and thus disabled by default.
The thread that creates the scope is its owner.
With every invocation of fork(…) new virtual thread is started by default — for execution of respective sub-task.
A sub-task can create its own nested StructuredTaskScope to fork its own sub-tasks, thus creating a hierarchy. Once the scope is closed all of its sub-tasks are guaranteed to be terminated ensuring no threads are leaked.
If fork()is invoked by thread which either is non-owner or not part of scope hierarchy, WrongThreadException is thrown.
Calling join() within a scope is mandatory. If a scope’s block exits before joining then the scope will wait for all sub-tasks to terminate and then throw an exception.
When join() completes successfully then each of the sub-tasks has either completed successfully, failed, or been cancelled because the scope was closed.
StructuredTaskScope enforces structure and order upon concurrent operations. Thus, using a scope outside of a try-with-resources block and returning without calling close(), or without maintaining the proper nesting of close() calls, may cause the scope’s methods to throw a StructureViolationException.
To allow for efficient cancellation, sub-tasks must be coded so that they finish as soon as possible when interrupted. Sub-tasks that do not respond to interrupts because, e.g., they block on methods that are not interruptible, may delay the closing of a scope indefinitely. The close method always waits for threads executing sub-tasks to finish, even if the scope is cancelled.
Sub-tasks forked in a scope inherit ScopedValue bindings. If a scope’s owner reads a value from a bound ScopedValue then each sub-task will read the same value. More details about the scoped values can be read here.

Completion Policies

To avoid any unnecessary processing during concurrent sub-task execution it is common to use short-circuiting patterns. In Structured Concurrency completion policies via out-of-box or custom StructuredTaskScope.Joiner can be utilized for specific requirements.

For common use cases below factory methods are provided →

allSuccessfulOrThrow() → Use it when results from all sub-tasks are of same type and are required for overall processing. In case any sub-task fails, it cancels the scope & remaining sub-task(s) and causes join to throw.
anySuccessfulOrThrow() → Use it if any one of the sub-task result is suffice for overall processing. As soon as the result of the first successful sub-task is available, join returns with the value. This policy causes join to throw only if all sub-tasks fail.
awaitAllSuccessfulOrThrow() → Default policy employed with StructuredTaskScope.open . Use it when results from all sub-tasks are required for overall processing. It is suited when the sub-task result are of different type, in contrast to allSuccessfulOrThrow() where result of sub-task are of same type. Here too in case any sub-task fails it cancels the scope, & remaining sub-task(s), and causes join to throw.
awaitAll() → This Joiner is useful for cases where sub-tasks make use of side-effects rather than return results or fail with exceptions. This Joiner can also be used for fan-in scenarios where sub-tasks are forked to handle incoming connections and the number of sub-tasks is unbounded. Note — The Joiner does not cancel the scope if a sub-task fails.

The earlier example of handle already shows the implicit usage of awaitAllSuccessfulOrThrow() as default completion policy via open . Below example exhibits usage of anySuccessfulOrThrow() which returns the result of first successful sub-task. A real world example would be to query from multiple servers for a response and return result of the very first response from any of the server (tail chopping).

 T race(Collection> tasks) throws InterruptedException {
    try (var scope = StructuredTaskScope.open(Joiner.anySuccessfulOrThrow())) {
        tasks.forEach(scope::fork);
        return scope.join();
    }
}

Note that as soon as one sub-task succeed this scope automatically shuts down, cancelling unfinished sub-tasks. The scope fails only if all of the sub-tasks fail.

In below example when all sub-tasks complete successfully, a stream of the sub-tasks results are returned as per the utilized policy of allSuccessfulOrThrow .

 List runConcurrently(Collection> tasks) throws InterruptedException {
    try (var scope = StructuredTaskScope.open(Joiner.allSuccessfulOrThrow())) {
        tasks.forEach(scope::fork);
        return scope.join().map(Subtask::get).toList();
    }
}

Here, if one or more sub-tasks fail then join() throws a FailedException — with the exception from one of the failed sub-tasks as its cause. The join() method returns a stream of completed sub-tasks rather than null, making this joiner suited for cases where all sub-tasks return a result of the same type and where the Subtask objects returned by the fork method is ignored.

Custom Completion Policies

The Joiner interface can be implemented directly in order to support custom completion policies. It has two type parameters: T for the result type of the subtasks executed in the scope, and R for the result type of the join() method. The key methods to be implemented are onFork , onComplete and result .

In below custom Joiner class the results of sub-tasks that complete successfully are collected, ignoring sub-tasks that fail. The onComplete method may be invoked by several threads concurrently, and so must be thread-safe. The result method returns a stream of the task results.

class CollectingJoiner implements Joiner> {
    private final Queue results = new ConcurrentLinkedQueue<>();
    public boolean onComplete(Subtask subtask) {
        if (subtask.state() == Subtask.State.SUCCESS) {
            results.add(subtask.get());
        }
        return false;
    }
    public Stream result() {
        return results.stream();
    }
}

This custom policy can then be used as below →

 List allSuccessful(List> tasks) throws InterruptedException {
      try (var scope = StructuredTaskScope.open(new CollectingJoiner())) {
          tasks.forEach(scope::fork);
          return scope.join().toList();
      }
  }

Exception Handling

When the scope is considered to have failed, the join() method throws a FailedException which wraps the underlying cause from the sub-task.

In the handle() example, it may be useful to add a catch block to the try-with-resources statement in order to handle exceptions after the scope is closed as shown below →

try (var scope = StructuredTaskScope.open()) {
   ...
} catch (StructuredTaskScope.FailedException e) {
   Throwable cause = e.getCause();
   switch (cause) {
       case IOException ioe -> ..
       default -> ..
   }
}

Configurations

The StructuredTaskScope API has an additional overloaded open method which accepts a Joiner together with a function which can produce a configuration object. This can help set the scope’s name for monitoring and management purposes, set the scope’s timeout, and to set the thread factory to be used by the scope’s fork methods to create threads.

Below is a revised version of the runConcurrently method which sets a thread factory and a timeout →

 List runConcurrently(Collection> tasks,
                            ThreadFactory factory,
                            Duration timeout)
    throws InterruptedException
{
    try (var scope = StructuredTaskScope.open(Joiner.allSuccessfulOrThrow(),
                                              cf -> cf.withThreadFactory(factory)
                                                      .withTimeout(timeout))) {
        tasks.forEach(scope::fork);
        return scope.join().map(Subtask::get).toList();
    }
}

If the timeout expires before or while waiting in the join() method then the scope is cancelled, which cancels all incomplete sub-tasks, and join() throws a TimeoutException.

Conclusion

Structured Concurrency in Java represents a significant evolution in concurrent programming, addressing many of the shortcomings found in traditional unstructured concurrency models.

Key advantages of structured concurrency in Java includes —

Automated Cancellation — Sub-tasks are automatically cancelled upon failure, reducing resource wastage and eliminating the complexity of manual cancellation handling.
Clear Task Hierarchy — The hierarchical task structure enhances code readability, maintainability, and observability, making it easier to debug and monitor concurrent operations.
Improved Error Handling — Centralized error handling through structured concurrency ensures predictable and robust behavior in the presence of exceptions.
Enhanced Observability — The clear parent-child relationships displayed in thread dumps and call stacks aid developers in understanding and managing concurrent tasks.
Virtual Threads — The synergy with virtual threads allows for efficient and scalable concurrent programming, making it possible to handle a large number of concurrent tasks without the overhead of traditional threads.

By adopting structured concurrency, Java developers can write more efficient, reliable, and maintainable concurrent code, ultimately leading to better software quality and improved developer productivity.

References and Further Reads

Structured Concurrency in Various Programming Languages

While Structured Concurrency is still being previewed in Java, it is already available in various programming languages. Here is sneak peak for few of these languages.

Kotlin

Coroutines — Kotlin offers coroutines, which are a lightweight concurrency model that allows for asynchronous programming without blocking threads. Coroutines provide structured concurrency through scopes, ensuring that tasks are properly managed and cancelled when necessary.
Structured Concurrency — Kotlin’s structured concurrency is built into its coroutine framework, making it easy to write concurrent code that is both efficient and easy to understand.

Goroutines — Go uses goroutines, which are lightweight threads managed by the Go runtime. Goroutines can be easily created and managed, allowing for high concurrency.
Channels — Go provides channels for communication between goroutines, enabling structured concurrency by ensuring that tasks can communicate and synchronize effectively.

Python:

Asyncio — Python’s asyncio library provides support for asynchronous programming using coroutines. Asyncio allows for structured concurrency through the use of tasks and event loops, ensuring that tasks are properly managed and cancelled when necessary.
Task Groups — Python’s asyncio library includes task groups, which provide a way to manage and cancel groups of tasks together, ensuring that tasks are properly coordinated.

Async/Await — C# provides support for asynchronous programming using the async and await keywords. This allows for structured concurrency by ensuring that tasks are properly managed and cancelled when necessary.
Task Parallel Library (TPL) — The TPL provides support for parallel programming in C#, allowing for the creation and management of tasks in a structured manner.