visit
Feel free to contribute on 💚
The IndirectPages always form a trie-structure. The trie itself might consist of several levels, whereas a level is added on demand if the current level can’t address the stored data anymore. The tries either index revisions or user-defined, AVL tree indexes (underneath a PathPage, a CASPage or a NamePage — depending on the kind of index).
The whole tree-structure is a huge persistent tree.
A persistent data structure is defined on Wikipedia as follows:In computing, a persistent data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead, always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans’ 1986 article.”
SirixDB shares unchanged structures between revisions.
The following picture shows the page structure of two revisions after a transaction commit of the second revision:SirixDB stores an in-memory transaction intent log, which is only serialized on memory-pressure. On transaction-commit, SirixDB applies this log during a postorder-traversal. The UberPage, which is the main entry point into the resource-storage, is written last.
The database pages in SirixDB, in general, are of variable length and don’t have to be padded to fit into a predefined block-storage size. RecordPages store the application data. Currently, a native layer for XML, as well as JSON documents, exists. Fine-granular, fast random reads due to byte-addressable non-volatile memory allows the versioning of these RecordPages as well.
SirixDB implements several versioning strategies best known from backup systems for copy-on-write operations of record-pages. Namely, it either copiesMarc Kramis came up with the idea of a novel sliding snapshot algorithm, which balances read/write-performance to circumvent any write-peaks.
SirixDB does not need a write-ahead log. During a transaction-commit the UberPage is swapped in an atomic operation. Thus, the data is always consistent without the need of a WAL, which is basically a second durable storage and completely unnecessary due to SirixDBs design.
SirixDB bridges the gap between in-memory and disk-oriented database systems. Instead of a heavyweight buffer and concurrency manager, it uses a lightweight buffer manager and stores direct pointers to in-memory data structures, during a fetch of a page from persistent storage. If the buffer manager has to evict a page, it sets the in-memory pointer of the parent page to null.
Next, we’ll describe some of the most important design goals, while engineering SirixDB.Minimize Storage Overhead
SirixDB shares unchanged data pages as well as records between revisions, depending on a chosen versioning algorithm during the initial bootstrapping of a resources. SirixDB aims to balance read and writer performance in its default configuration. It uses a huge persistent (in the functional programming language sense), durable index structure for the main document storage as well as versioned secondary indexesConcurrent
SirixDB contains very few locks and aims to be as suitable for multithreaded systems as possibleAsynchronous
Operations can happen independently; each transaction is bound to a specific revision and only one read/write-transaction on a resource is permitted concurrently to N read-only-transactionsVersioning/Revision history
SirixDB stores a revision history of every resource in the database without imposing extra overhead. It uses a huge persistent, durable page-tree for indexing revisions and dataData integrity
SirixDB, like ZFS, stores full checksums of the pages in the parent pages. That means that almost all data corruption can be detected upon reading in the future, we aim to partition and replicate databases in the futureCopy-on-write semantics
Similarly to the file systems Btrfs and ZFS, SirixDB uses CoW semantics, meaning that SirixDB never overwrites data. Instead, database-page fragments are copied/written to a new locationPer revision and per page versioning
SirixDB does not only version on a per revision, but also on a per page-base. Thus, whenever we change a potentially small fraction
of records in a data-page, it does not have to copy the whole page and write it to a new location on a disk or flash drive. Instead, we can specify one of several versioning strategies known from backup systems or a novel sliding snapshot algorithm during the creation of a database resource. The versioning-type we specify is used by SirixDB to version data-pages
Guaranteed atomicity and consistency (without a WAL)
The system will never enter an inconsistent state (unless there is hardware failure), meaning that unexpected power-off won’t ever damage the system. This is accomplished without the overhead of a write-ahead-log (WAL)Log-structured and making the most out of byte-addressable NVM friendly
SirixDB batches writes and syncs everything sequentially to persistent storage during commits. It never overwrites committed data and only clusters revisions. Database page fragments are of variable size and may be scattered. Thus, SirixDB benefits from fast random access and byte-addressable NVM devicesConclusionAs writes are slower than reads on NVM and to save storage space for its evolutionary store, RecordPages are versioned. NVM eliminates the need to store all unchanged records of a page, even if a user only modifies a small number of records or even a single record. Thus, RecordPages are stored in fragments and scattered on persistent storage. Their size is variable, and they might be scattered on persistent storage.
The storage device underlying SirixDB should support fine granular byte-addressability or small block-sizes and fast random reads with possibly slower writes. Pages are never modified in-place. Thus “traditional” SSDs are a good fit as well.SirixDB batches writes and syncs these to persistent storage during a commit. It doesn’t need a WAL due to the atomic swapping of an UberPage, which is the main entry point into the storage of a resource. A resource is the equivalent to a table in relational database systems (currently either XML or JSON, but graphs or tables might be added in a future release).
My highest goal is to stabilize the storage engine, add some stuff needed for the front-end (for instance a JSON diff-format) as well as some JSON layer transactional primitives to copy subtrees into another resource.
I’m also working on a web together with another enthusiastic software engineer . We currently use Nuxt.js, TypeScript (some JavaScript) and I probably want to use D3.js in the future. However, I’m no front-end engineer and I’m just learning, Nuxt.js (Vue.js), TypeScript and struggle a lot of times. Any help would be awesome.
I’m envisioning a front-end with which we’re able to analyze the differences between revisions, for interacting with the storage and for executing time-travel queries to analyze for instance products or do audits.I have plenty of ideas to move forward, as for instance storing the log in Apache Kafka for horizontal scaling / sharding, adding cost-based query optimizations, building special non-blocking HTTP-Clients, adding Web-Sockets, a proper Kotlin DSL…Contributions are more than welcome 👩💻👨💻
Have fun and happy holidays
Johannes