Parallel Computing has been an area of active research interest and application for decades. Nowadays, Parallel Execution on EVM is emerging as a new paradigm in the blockchain industry.
Oct, 17th, 2022, NodeReal delivered a feature release: Parallel EVM 2.0 which is based on the latest BNB Chain release v1.1.16. You may refer to the detailed release note at node-real repo.
Actually, it has been a while since The Implementation of Parallel EVM 1.0. As we have noted in that blog, the whole Parallel EVM implementation is supposed to be accomplished in several phases:
- 1.0 is the foundation where the architecture and workflow will be set up.
- 2.0 is the performance-enhanced version.
- 3.0 will apply parallel to miner mode.
- 4.0+ has not been determined yet.
Parallel execution is believed to be able to greatly improve the blockchain performance, as many high-performance Layer1 blockchains claim that they support parallel execution, such as Solana, Avalanche, Aptos, Sui... However, these blockchains did some tradeoffs at the very beginning, sacrificing some user experience to make it much easier to support parallel execution. The inconvenience is mainly due to the requirement of a predeclared access list of each contract interface, making it lack dynamic access ability.
But for Ethereum or BNB Chain, it is not the case, that’s why there is no big progress to support parallel in Ethereum or BNB Chain yet.
As we implement the parallel execution on BNB Chain, we encountered many challenges, especially to make it stable and efficient. In this blog, we will walk you through the architecture design and the latest result.
The BSC Network Traffic
Before digging into the detailed implementation, we would like to do a brief sharing on how the project is carried out.
Parallel EVM has been desired for a long time. There was tremendous traffic on the BSC network around Nov 2021, as shown by the following diagram, which finally makes us put the Parallel EVM on agenda.
Milestone Phase 1.0
Phase 1.0 was kicked off in the middle of Dec 2021, it took us around 3 months to design, implement, tuning and test.
Milestone Phase 2.0
Phase 2.0 was kicked off at the start of April 2022 and it took us around 2 months.
Milestone Phase 3.0
We planned to support parallel validator mode in phase 3.0, but it has not started yet. The current implementation is very complicated. Rather than adding more code, it would be better to take a break and rethink it.
This part would be a bit technical and needs some engineering background, but it is crucial to understand the Parallel EVM implementation, if you are not quite interested in the details, you may skip it.
In short, Parallel EVM is to improve the performance of block processes by executing the transactions concurrently, as simplified and illustrated by the following diagram.
To be more specific, there will be two major components: Dispatcher & Slot.
A configured number of slots(thread) will be created on process startup, each slot will receive transaction requests from the dispatcher and execute transactions concurrently. Transactions that are dispatched to the same slot will be executed in order.
The dispatcher will dispatch the transactions dynamically and efficiently, mainly trying to achieve a low conflict rate and a balanced workload.
The following diagram shows the general workflow. If you are interested and wanna go deeper, it would be better for you to read the code directly.
Here are the major components of Parallel EVM. The fundamental components were implemented in phase 1.0, which is gray-marked. In phase 2.0, we added new components such as Streaming Pipeline, Unconfirmed Access, Memory Pool, Balance Makeup and we also redesigned components like dispatcher, and conflict detector to improve the performance.
Transaction execution will be split into 2 stages: The Execution Stage & Finalize Stage.
- Execution Stage: it is a stage of pure EVM execution, the validity of the execution result will not be determined until finalization, a transaction could be executed several times to get the valid result.
- Finalize Stage: Once the result of the execution stage is confirmed to be valid, the finalized stage will be used to commit the execution result, and flush all state changes into the base StateDB.
Dispatcher is responsible for transaction preparation, dispatch and result merge; The dispatch policy could impact the execution performance, we tried to dispatch these potential conflict transactions to the same slot, to reduce the conflict rate;
In Dispatcher 2.0, we actually removed the dispatch action, the dispatch channel IPC is no longer needed. Dispatcher 2.0 has 2 parts: static dispatch & dynamic dispatch.
- Static dispatch is done at the beginning of the block process, it is responsible to make sure potential conflict transactions are dispatched to the same slot and try best to make a workload balance between slots.
- Dynamic dispatch is for runtime, there is a stolen mode when a slot finishes its static dispatched tasks, it can steal a transaction from another busy slot.
Universal Unconfirmed State Access
It is a bit complicated, with unconfirmed state access, there will be a priority to access StateObject, with descending priority order: Self Dirty ➝ UnConfirmed Dirty ➝ Base StateDB(dirty, pending, snapshot and trie tree).
- Self Dirty: the state has been updated by the transaction itself.
- Unconfirmed Dirty: the state has been updated by other transactions, which have not been committed yet, and their state may not be valid.
- Base StateDB: the state of the base StateDB is always valid since only committed results will be flushed to base StateDB.
In a word, the principle is to try best to get the most reliable state info.
Transactions are executed sequentially and have a dependency on the previous result for both BNB Chain and Ethereum, since they use a shared world state;
Conflict detector is used to check if the parallel execution result is valid or not. If it is not valid, it will be rescheduled as a redo.
In phase 1.0, We use DirtyRead policy to do conflict detect, that is if any of the state(balance, nonce, code, KV Storage...) read by the transaction was changed(dirty), we mark it as conflicted; These conflicted transactions will be scheduled a Redo with the latest world state.
In Parallel 2.0, we do conflict detection based on reading. We no longer care what has been changed, the only thing we should care about is to check whether what we read is correct or not. We will keep the detailed read record and compare it with the base StateDB on conflict Detect. It is more straightforward and accurate.
To make conflict detection more efficient, we brought two new concepts:
- Detect In Advance: when the majority of transactions have been executed at least once, a dedicated routine will do conflict detection in advance. It is configurable and more efficient.
- Parallel KV Conflict Detection: it is CPU-consuming to do conflict check, especially the storage check, since we have to go through all the read addresses, each address could have lots of KV read records. It is one of the bottlenecks right now, we do KV conflict detection concurrently to speed it up.
The merge component is to merge the confirmed result into the base StateDB. It is executed within the dispatcher routine since it needs to access the base StateDB; To keep concurrent safe, we limit the access to base StateDB on dispatcher only, the parallel execution slot could not access the base StateDB directly;
In phase 1.0, we allocate a new StateDB for each transaction, it is a snapshot copy of the base StateDB. According to the memory analysis, each copy operation could take ~62KB of memory. These memories are allocated for each transaction and will be discarded when the transaction is finished. These short-lived objects will trigger GC frequently, which will have a big impact on the performance.
To avoid allocating objects too frequently, we use a memory pool to manage all these objects. They will be recycled asynchronously while the block is committing.
Parallel 1.0 uses DeepCopy to copy state objects, it is cpu, memory-consuming when the storage contains lots of KV elements. We replaced it with LightCopy in phase 2.0 to avoid redundant memory copy of StateObject. With LightCopy, we do not copy any of the storage at all, actually, it is a must if we use UnConfirmed State Access , since the storage would be accessed from different unconfirmed StateDB, we can not simply copy all the KV elements of a single state object.
The lifecycle of transaction execution would help better understand how these components work together.
In general, it could be divided into 3 stages:
- Pre-Stage: do some preparation, e.g. which thread to execute and set up a dedicated StateDB for it.
- RT-Stage: runtime stage, the transaction is executed in this stage, and it will try to access the world state.
- Post-Stage: after execution, it will wait to do conflict detection, redo if conflict and commit if valid.
It is a bit different between phase 1.0 and phase 2.0 for transaction execution since we introduced more components. The main differences are:
- There is no dispatch on transaction execution since it has been done in advance.
- The creation of StateDB is moved from the dispatcher thread to the parallel thread.
- CopyOnWrite is replaced by UnconfirmedAccess and LightCopy.
- Conflict Detect and Finalize have been moved to the dispatcher thread.
It would be more straightforward to show the whole picture through in a pipeline view:
Take 8 concurrency as an example for a brief view.
In phase 1.0, it is a simple pipeline, there could be many wait periods(Orange Background) to synchronize the parallel execution state.
In phase 2.0, we introduced Streaming Pipeline, which greatly eliminated the waiting period, to make sure the pipeline could be more efficient.
I could reveal more detail here on the streaming pipeline.
Operations of ConflictDetect, Finalize, Merge will all be done within the dispatch thread. And each parallel execution slot will have a shadow slot, it is a backup slot, doing exactly the same job as the primary slot. Shadow slot is used to make sure redo can be scheduled ASAP.
When the execution stage is completed, it doesn't need to wait for its previous transaction's final result. The transaction can queue its result and move on to execute the next transaction in the pending queue without waiting.
That is called streaming pipeline, since transactions can be executed continuously and there is no waiting anymore.
We set up several devices to run the compatible test of full sync, almost all the blocks on the BNB Chain testnet and mainnet are covered.
We set up 2 instances to test the performance benefit, with parallel number 8 and pipe commit enabled.
The 2 instances use the same hardware configuration, with 16 cores, 64G memory, and 7T SSD,
It ran for ~50 hours. The total block process(execution, validation, commit) cost is reduced by ~20% -> ~50%, the benefits vary for different block patterns.
Have A Try
We have rebased the code to the latest BNB Chain release v1.1.16, you may have a try if you are interested.
For the detail usage, you may refer: https://github.com/node-real/bsc/releases/tag/v1.1.16+Parallel-2.0
Please be aware that:
- Only full sync mode is supported, validator mode is not supported right now.
- Only the BNB Chain is supported right now. Generally speaking, it could be fine to run Ethereum too, but it has not been fully tested and could be broken.
NodeReal is a one-stop blockchain infrastructure and service provider that embraces the high-speed blockchain era and empowers developers by “Make your Web3 Real”. We provide scalable, reliable, and efficient blockchain solutions for everyone, aiming to support the adoption, growth, and long-term success of the Web3 ecosystem.
Join Our Community
Join our community to learn more about NodeReal and stay up to date!