Q4 2022 Stability and Performance Improvements - Besu - LF Decentralized Trust Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window) PreferencesOnly necessaryAccept all LF Decentralized Trust LF Decentralized Trust Spaces Apps Templates Create Besu All content Shortcuts Meetings Meetings  This trigger is hidden How-to articles How-to articles  This trigger is hidden Content Results will update as you type. Code of Conduct Contributing Developing and Conventions Documentation Community Governance Programs & Grants Meetings Design Documents Security Audits Start Here Performance & Stability 2024 - Besu Performance Improvements since the Merge Fast sync optimisation Memory usage investigations on 23.7.3-RC Permissioned chain performance testing Q4 2022 Stability and Performance Improvements Shanghai planning Reduce Memory usage by choosing a different low level allocator Testing Taskforce Brainstorming How-to articles Incident Reports Besu Roadmap & Planning How to Contribute You‘re viewing this with anonymous access, so some content might be blocked. Close Besu / Q4 2022 Stability and Performance Improvements More actions Q4 2022 Stability and Performance Improvements Matt Nelson (Deactivated) Owned by Matt Nelson (Deactivated) Last updated: Nov 04, 2022 Stability Some solutions already implemented  Sync stalling  Rules about what is considered our best peer post-merge is not implemented Trying the same peer over and over again (shuffling the peers) Invalid block errors  Consensus layer is on the fork with bad data Besu has a storage exception to report the invalid block to the consensus layer, which sets us off to a wrong fork (potential fix by Justin; GH issue) Potentially another Besu internal error could cause invalid blocks of which we don’t know about yet  Potential solution identified   Worldstate root mismatch Bonsai and snapshots Solution: Confirmed working for many cases, needs more testing and handling of any corner cases Issues around peering Need restart to find new peers sometimes during sync Potentially because of the lack of evaluating of peers during sync and post sync  Solution: ?? Losing many peers  Because threads were blocked, we lost many peers  Vertx for example uses different approaches to threading   Solution: ?? Issues on user experience  Difficulty in reliably communicating with new/inexperienced users Docs & lack of education  More complicated set up post-merge Solution: Write up ‘What to expect from staking at home’ and FAQ for Besu Docs Out of Memory errors Documentation on how kind of memory config is needed  No mechanism to detect memory leaks  Potential mitigation: make deploying Besu easier by providing default configs  Solution: ?? Users don’t know how much syncing is done Insufficient logging & bad log UX   Solution/plan: ?? Users hesitant to update or restart Besu with the latest version due to the impression of being unstable  Issues with RPC calls Incompatibilities with RPC spec / not-same-as-geth causing crashes  Does not meet Chainlink and other orgs needs for RPC calls (accuracy, speed)   Do we implement all of the RPC interfaces that Geth does? I.e. Logger, Trace (all the methods)  Solution: ?? Some specific RPC calls (trace/debug) take a long time or OOM  Lack of testing of large RPC calls Might need to understand the root cause better  Solution: ? Performance Staking Performance Poor execution performance leading to missed attestations More investigation ongoing, and some user stories being created  Poor block production As we tweak the tx pool to build the best block for the user with valuable/DOS-resistant transactions, we need to ensure no performance hit to the client - uses a lot of CPU because we are repeating the block production until CL asks for it Late blocks could also cause import challenges, would cause restart of the building of the block  Snapshots could help with concurrency in this case  4844 could alter this process and requires good performance as well I/O and Disk Performance  Besu has problems with slow IO/disks → Besu is generating a lot of IO  We are not using the flat DB during block processing, so have to gather a lot of data from disk  Need caching in more areas - R/W caching  Doing less work, persisting less to the disk, persisting trie logs but not the worldstate (Amez / Karim)  The first hotspot in Besu is reading data from RocksDB using the RocksDB.get method. This is mainly caused by the fact that we have to get most of the WorldState nodes from the Patricia Merkle Trie Need to identify more areas where IO contention is commonplace  Trace Performance  Poor performance of tracing of blocks / transactions Not sure why we are slow  Many times Besu would crash when tracing a full block  OOM errors Short timeout can cause issues  Is db tuned for tracing? 3786  Will need good performance for any rollups use-cases Solution?: Instead of replaying each time the traces for each user request why not saving this trace result in a separate database or a separate module instead of saving the block and the worldstate for each block Solution?: Separate into a different microservice Solution?: Separate the query part of Besu into a completely separate process. Queries that are asking Besu questions related to the chain do not need to slow down the main flow. Sync Performance Poor syncing performance (still the case with 22.10.0?) Do we need to verify the proof of work blocks on Mainnet?  Useless conversion from byte to RLP and back during sync  Sometimes we are stuck for some time during a snap sync (need investigation)  Full sync / forest performance (also snapsync for the block downloading part) Persisted worldstate changes may be able to help with full sync on Bonsai Forest we need to determine what the areas are for performance improvements - some of the recent bonsai improvements can be tweaked to suit forest use-case (unknown though)  EVM Performance - Pending Amez Availability and IMAPP Testing  We need analysis that will tell us Gas cost of each operation corresponds to the algorithmic complexity of Besu implementation. Bonsai might have consequences on algorithmic complexity for some operations. IMAPP testing - we need an overall analysis (Matt has connected with this team for a profile) Do not have a profile of Besu’s EVM performance (work with Danno?)  SLOAD, SSTORE - slowest vs gas cost?   EVM performance improvements often appear without context and the broader team is unsure of how the optimizations are created. Is there a standard playbook of optimizations that we are running through, or are there EVM specific performance observations that we are reacting to? Do we know how we can solve these problems? Having automatic test on nightly/ci to detect regression asap Having more modularity Having a tracing solution on the JAVA level (improve observability) Using torrent for downloading the block during the initial sync (archive)? Separate the query part of Besu into a completely separate process. Queries that are asking Besu questions related to chain do not need to slow down the main flow. Process Improvements (Q1?)  Performance Testing Lack of performance testing, especially on RPC methods How do we get alerts and be aware that there is an actual performance regression? Automated performance testing each release, nothing for the moment Hive tests return the time the calls are taking, but small load  Can we separate some RPC methods into separate microservices? Slow Release / Testing Process (CPU, test bounding, process)  Manual release process with a lot of wasted time waiting for builds that could be avoided Waiting for multiple full builds to complete because you are merging a PR or changing version number, doesn’t need a full build With code changes, then a full build should be required  Make it easier/faster to run all the tests locally, or avoid the need to create a draft PR for running tests remotely Support for many features causes tests to be slow (ETC tests, Quorum tests, etc.) when they are not necessarily needed for certain modifications  Issues with the process  Late discovered regressions Need a more comprehensive testing strategy across contributors Solution: ?? Establish better process on how to respond to the problem we discover  Good case study: sepolia issue over the weekend on Oct 29/30 , multiple selections available, Related content More info Collapse Performance & Stability Performance & Stability Besu Read with this 2024 - Besu Performance Improvements since the Merge 2024 - Besu Performance Improvements since the Merge Besu Read with this Modular Consensus Modular Consensus Besu More like this 2025-03-18 Contributor Call 2025-03-18 Contributor Call Besu More like this 2024-09-03 Contributor Call - US/EMEA 2024-09-03 Contributor Call - US/EMEA Besu More like this Besu Roadmap & Planning Besu Roadmap & Planning Besu More like this {"serverDuration": 13, "requestCorrelationId": "b42ddd2ccd65455d826c4aa694fad406"}