Every time a Catalyst is restarted, the content server processes a snapshot containing all the active entities for each server. It reads all the entities in these snapshots and deploys the ones that it does not already have locally. This process takes up to ~half an hour. This mechanism is defined in the ADR-52, please read the document for more context. This document discusses and proposes solutions to improve this situation.
Every time a Catalyst is started a Catalyst takes up to half an hour to be ready. It's a necessary process to be synchronized with the rest of the catalysts. But this brings many complexities:
With this work, we estimate to delete or reduce as much as possible the job of re-processing entities that were already processed. We estimate to reduce the bootstrapping time to less than 5 minutes.
Every 6 hours a Catalyst runs a job that re-creates a snapshot file that contains all its active entity hashes and also the timestamp of the latest entity included. When a Catalyst starts, it asks each of the rest of the nodes for this snapshot file and processes all of them. A snapshot is at most ~6 hs outdated. Then using the timestamp of the last entity included in it, asks for the "newest" active entities using an expensive endpoint /pointer-changes. Before the snapshots mechanism, the /pointer-changes was used to synchronize the whole catalyst and there were performance issues. It is a huge improvement for starting a catalyst from scratch and it works relatively fine for a synchronization not from scratch (i.e. normal restart). But we want to continue improving that by solving the problem where for each restart it needs to process the whole snapshot in order to be up to date even though the downtime is very little.
Leverage the snapshots mechanism by partitioning the snapshots by time range. The snapshot structure and how it is processed will remain the same. A Catalyst will serve the content of all its active entities in multiple snapshots separated by its entity timestamp. Any time a snapshot is processed, a Catalyst remembers this and won't process it again anymore.
Instead of generating a complete snapshot every 6 hs, it will generate snapshots as time goes by only for the newest entities. Let's say the snapshot units are daily, weekly, monthly and yearly. It will create daily snapshots until a week has passed, then it will replace the 7 daily snapshots with the weekly one and continue generating daily snapshots again. The same logic goes for weekly, monthly and yearly snapshots.
All the Catalysts will have the same initial timestamp tinitial from which they will generate the same time ranges for the snapshots: [tinitial, t1], [t1, t2], ...,[tn-1, tn].
Given a time range [t1, t2], a snapshot would contain "all the entities with entityTimestamp
within [t1,t2]
that are active at the time of generation". The information associated to an entity in a snapshot is constant data across all
Catalysts: entityId, entityType, scene, pointers, authChain and
entityTimestamp. This is really useful because it implies eventual convergence of the
snapshot in the different Catalysts. It is eventual because it depends on the
generation time of the snapshot.
A Catalyst asks the other Catalysts in the network for one for the snapshots, then it will process it by deploying all the entities within it, but will save in the database the snapshot hash. Next time a Catalyst sees these hashes, it won't process the same snapshot again. Note: if one deployment of a snapshot fails, it is persisted as a failed deployment in the database, so the snapshot can be considered as processed. It'll be retried in the future.
This way we avoid reprocessing snapshots but two complexities arise:
To address this situation, it is proposed that the Catalyst serving the active entities, provides: a. The hash of the snapshot b. A list of hashes that it replaces. Now a Catalyst processing the snapshots would process the snapshot only if it hasn't processed its hash or one hash of the list of replaced ones.
The returned objects from the endpoint /snapshots
will change to:
const snapshots: SnapshotMetadata[] = fetch("/snapshots")
export type SnapshotMetadata = {
hash: string
timeRange: {
initTimestamp: number
endTimestamp: number
}
numberOfEntities: number
replacedSnapshotHashes?: string[]
}
There will be important performance improvements in the Catalyst bootstrap when a synced Catalyst is restarted:
/snapshots
? In the first version of the Fast Boostrapping code, the big
snapshot will be still generated and served in the /snapshot
endpoint. If a
Catalyst fails to read the snapshots in /snapshots
it'll make a fallback to the
old /snapshot
endpoint.