visit
From the serverless computing perspective, however, direct application of this metaphor has limited value. Not only are individual data centers not visible anymore, but also the whole concept of Availability Zones also disappears. The minimal unit we could reason about is One Region of One Account of One Vendor, as illustrated below:
What is a serverless cloud computer ALU?
ALU stands for Arithmetic Logical Unit; in traditional computers it performs all basic arithmetic and Boolean logic calculations. Do we have something similar in our serverless cloud computer? In a sense, we do. As with any analogy, it’s good to know where to stop, but we could treat cloud functions (e.g., AWS Lambda) as a sort of ALU device with certain constraints. Currently, any AWS serverless cloud computer in the Ireland Region has 3000 such logical units with 3GB local cache (some people would still call it RAM), 500MB non-volatile memory (aka local disk, split into two halves), 15 minutes of hard context switch and approximately 1.5 hours of warmed cache lifespan. These logical devices run “micro-code” written in a variety of mainstream programming languages, such as Python, JavaScript, etc. Whether one is going to use this capacity fully or partially is another question. The Serverless Cloud Computer pricing model is such that you pay only for what you use.Unlike traditional ALUs, there are multiple ways to activate the micro-code of such serverless cloud logical devices and to control whether they will perform a pure calculation or also have some side effects. An ALU is just an analogy in the serverless world but within limits it appears to be a useful one.What is a serverless cloud computer CPU?
If we have 3K of serverless cloud ALUs, do we also have CPUs to control them, and do we really need such devices? The answer is that there are such devices, which are useful in a wide range of scenarios, but they are purely optional. Cloud orchestration services, such as AWS Step Functions, could play such a role, with internal Parallel Flows functioning similarly to individual cores. An AWS Ireland Region “Cloud CPU” could be occupied for up to 1 year with maximum 25K events. How many of such serverless cloud CPUs could we have? We can get 1,300 immediately, and then add another 300 every second. As with cloud functions, we will pay only for what we are using.What is a serverless cloud computer memory?
OK, we have ALUs (with some cache and NVM) and we have optional CPUs to orchestrate them. Next, do we have an analogy for RAM and disk storage? Yes, we do, but we might opt to stop speaking about an artificial separation between volatile and non-volatile memory. Modern CPUs make this separation meaningless anyhow. It’s better just to talk about memory. Serverless cloud computers have different types of memory, each with its own volume/latency ratio and access patterns. For example, AWS S3 provides support for Key/Value or Heap memory services with virtually unlimited volume and relatively high latency, while DynamoDB provides semantically similar Key/Value and Heap Memory services with medium volume and latency. On the other hand, AWS Athena provides high volume, high latency tabular (SQL) memory services, while AWS Serverless Aurora provides the same tabular (SQL) memory services with medium volume and latency.Interestingly, some serverless cloud “memory” services, such as DynamoDB, are directly accessible from Step Functions (aka serverless cloud CPUs), while others are only accessible through cloud functions (serverless cloud ALUs). For now, Step Functions have a 32K limit of internal cache memory and, as such, are suitable only for direct programming of control flows rather than voluminous data flows. Whether such a limit is a showstopper or a pragmatic trade-off choice is a subject for a separate discussion.A complete analysis of available services, which would include Serverless Cassandra, Cloud Directory and Timestream, is beyond the scope of this memo.What are serverless cloud computer peripherals?
Thus, we have serverless cloud computer ALUs, CPUs and Memory (all metaphorical, of course). Do we have something similar to peripherals in traditional computers? Yes, we do have something similar to ports, which connect our serverless computer to the external world. As with traditional ports, each one supports different protocols and has different price/performance characteristics. For example, AWS API Gateway supports REST and WebSockets protocols, while AWS AppSync supports GraphQL, and AWS ALB supports plain HTTP(s).A full analysis of available services, which would include CloudFront CDN, IoT Gateway, Kinesis and AMQP, is beyond the scope of this memo.What is a serverless cloud computer Bus?
So, we have metaphoric ALUs, CPUs, Memory and Ports for our serverless computer. Do we have something similar to a bus, and do we need one? The answer is yes, we do have several types, which sometimes are necessary. For example, AWS SQS provides Push high speed, medium volume service, while AWS SNS provides high speed, medium volume Pub/Sub notification service, and AWS Kinesis provides high speed, high volume Push service.What else does the serverless cloud computer have?
Unlike traditional computers, quite a few more batteries are included: a data flow unit (aka AWS Glue), a machine learning unit (Sage Maker endpoint), access control (AWS IAM), telemetry (AWS Cloud Watch), packaging (AWS Cloud Formation), user management (AWS Cognito), encryption (AWS KMS), component repository (AWS Serverless Application Repository), and a slew of fully-managed AI services such as AWS Comprehend, Rekognition, Textrat, and others.Complete specification of the Serverless Cloud Computer “hardware” is illustrated below:Following the useful tradition established by Dutch computing pioneer E.W. Dijkstra, we will treat the Serverless Cloud Computer metaphorical “hardware” specification outlined above as the bottom layer of a “necklace string of pearls” — namely, higher-level, domain-specific virtual machines stacked on the top of lower-level infrastructure virtual machines, as illustrated below:
What is a serverless cloud OS file system?
As we argued above, the whole concept of the file system is probably outdated and for application code development, we’d better start talking about cloud versions of traditional data structures such as lists, vectors, sets, hash tables, etc. All these data structures could be efficiently mapped on different serverless cloud memory services mentioned above.However, unless we are going to rewrite all available software, which would be impractical, we will sometimes still need to talk about files, for example, Python modules, Linux Shared Objects and Executables. Using local disk storage of cloud functions has to be treated as a special case, mainly for cold start optimization reasons. The ideal solution would utilize Linux to mount, depending on the price/performance ratio, directly to S3, DynamoDB, Serverless Cassandra or even Serverless Aurora.Unfortunately, that’s not possible today since the FUSE mount requires the Lambda container to run in privileged mode, which is not allowed for security reasons. Another possibility is to develop a cloud version of module importer for each run-time environment: Python, JavaScript, JVM. While this requires some extra work and is less friendly towards legacy code, the cloud importer allows some optimizations not available to the traditional disk-based one.
See our first in a trilogy of articles describe the construction of at BlackSwan Technologies.Similar logic applies to Linux Shared Objects and Executables. Ideally, ELF files should be directly loaded from the cloud memory source. That, in turn, would require modifications in the function — something hard to expect in the near future. One possible work-around would be to download Shared Library and Executable files from the cloud source to the /tmp folder first. That would bring us back to the 250MB disk space limitation for all Shared Libraries including Python extensions.
Another option is to imitate RAM disk, which would double memory consumption subtracted from a larger 3GB budget. As with cloud importer, some non-trivial optimizations to speed up binary files download are possible here.What is a serverless cloud OS process?
Now, we step into an uncharted territory. A clear analog to the Linux process is yet to be defined. Step Functions running State Machine (even though we have to stop calling them State Machines, which they are not) is a good candidate, but what about individual Lambda Functions triggered by some external event? Shall we treat them as interrupt handlers in traditional Operating Systems? That might be not such a bad idea, but only time will tell.What is a serverless cloud OS installation package?
The answer seems obvious: it’s a Cloud Formation Stack on AWS or a similar solution on another cloud platform. In the serverless world, Cloud Formation Stacks do not run — serverless applications have no daemon processes — nothing is running unless explicitly triggered by some external event. In this discussion, we exclude Fargate containers, which do run.
Therefore, launching a Cloud Formation Stack just means installing a copy of a Serverless Application. Although it will reserve some resources, it will not consume them until some real workload starts running. Well, almost… storage capacity will still be consumed even in passive mode, but this is no different than disk space occupied by some application even if it has never been started.What is a serverless cloud OS interprocess communication?
This is another blurry area that requires further elaboration. Traditional Operating Systems, like Linux, have two standard and one semi-standard interprocess communication mechanisms. Shared memory and pipes, named or ephemeral, are two standard interprocess communication mechanisms coming 50 years back to Unix. Tcp/IP is a kind of semi-standard IPC and is mostly devoted to larger scale middleware arrangements.What is a serverless cloud OS Shared Memory?
All serverless cloud memory services mentioned above are basically shareable. We still need to properly utilize the mutual exclusion and transaction scoping mechanisms available for each one of them. supply an interesting source of inspiration.What are serverless cloud OS pipes?
Unfortunately, we do not have serverless cloud OS pipes. More accurately, we do not have good ones. Serverless cloud bus services enlisted above do a decent job, but for a very limited set of scenarios. To use a biological metaphor, they are good for central veins and arteries, but not for capillaries. As for now, it’s impractical to create a separate SQS queue for each flow — it takes too long to create, and it does not scale well for a large number of flows. If we decide to fanout some processing to a queue, it’s not trivial to figure out when all messages belonging to a particular flow have been processed. Using serverless cloud shared memory facilities, it should be possible, in principle, to develop good, lightweight, economical pipes. This is a direction for additional research.What is serverless cloud networking?
Some interesting R&D activities are taking place currently in this area (see references).Optimal Concurrency Structure
Within a typical Serverless Cloud Computer, such as AWS, one could identify the following distinct levels of concurrency:This discussion of the optimal concurrency structure reveals another important aspect: currently available tools for specifying AWS Step Functions, Lambda Functions and Cloud Formation Stacks are at despairingly low level of abstraction — like a kind of machine code. Calling these long and ugly JSONs and YAMLs human readable would be funny if it were not so sad.
There is no reason why their internal structure could not be treated as a target platform for some high-level compiler. It could be done, and it should be done.Optimal Packaging
Sticking with the 250MB code size limit of AWS Lambda does not make very much sense. Today, due to this limitation many ML inference processes have to opt for less convenient container packaging, even though available 3GB RAM would be more than enough for performing the task. There is no practical reason why Python modules, for example, could not be imported directly from S3.Python importlib allows this in principle. The same logic applies to Linux Shared Objects. While a proper solution would require a deep intervention into AWS Firecracker — which is not beyond reach in the future, but is less practical in the near term — a close approximation based on additional 250MB of /tmp space is possible today.
But now, we face another problem. Cloud import of, say, Python (the same logic applies to JavaScript, Java and .NET), as well as Linux Shared Objects, would increase so-called cold start latency. For many applications, it won’t constitute an issue and overall productivity gains (given no need to package zip files anymore) would easily outweigh another couple of seconds of delay (free of charge, by the way). For some other applications, that might be a problem. That leads us to yet another optimization challenge: to find an optimal combination of imported modules and shared objects to be placed into an AWS Lambda package (directly or via AWS Lambda Layers) based on a suitable ML model and collected operational statistics.
As with optimal concurrency structure, this is a task for a high-level compiler. We shall treat every case when software engineers are engaged in manual activities that obviously could be improved through automation as a waste of valuable time.Also, notice an emerging pattern here. While traditional operating systems and compilers provide some forms of static optimization, the new serverless cloud world requires an optimization process to be dynamic, constantly repeated, and based on collected operational statistics, as illustrated below:Portable “Hardware” Abstraction Layer
As it usually happens with operating systems, both optimization problems outlined above, require some form of abstraction insulating core algorithms from technical details of each specific cloud platform. Indeed, 90% of the cloud Python import system depends on the Python module system rather than on how Cloud Storage of AWS vs GCP works. The same logic applies to Linux Shared Objects and concurrency structure. Of course, the same “hardware” abstraction would be useful as a productivity tool for writing portable application code, but here we still have some way to go until reaching the Framework Layer.The project, code name CAIOS (which stands for Cloud AI Operating System, to highlight a deep connection with managed AI capabilities), is currently conducted by BST LABS as an internal open source project within the parent company, BlackSwan Technologies:
Shutting down the economy for a pandemic has never happened.
…
If your business model today looks the same as it did at the beginning of the month, you’re in denial.Indeed, for the software industry, the party may be over — if not right now, then in the foreseeable future. We will no longer be able to command a premium for developing half-realized services poorly matched to real users’ needs, then delivering them late and over-budget. Because of pandemic concerns, the need for automation will go up. In short order, the tolerance for inflated operational and development costs, poor quality and security, and late delivery will diminish, then disappear.Ironically, our circa March 2020 business models date almost 50 years back. In his seminal, E.W. Dijkstra made the following comment:
Nowadays one often encounters an opinion that … programming had been an overpaid profession … perhaps the programmers … have not done so good a job as they should have done. Society is getting dissatisfied with programmers and their products.The year was 1972.Today, we software developers still earn an order of magnitude more than school teachers. But are we doing a more important job, or at least are we doing our job well enough to earn our keep?
Alas, the answer is probably not, and here is why: most of the engineers employed in the software development industry are still busy with what Simon Wardley called yak shaving: moving software components from one place to another, configuring and re-configuring infrastructure, and in their remaining time writing pieces of code that marginally move the ball forward for a point solution.
As an industry, and as professionals, we are caught largely unprepared for what is going to happen: society needs real automation solutions now. Nobody is interested anymore in our justifications for the status quo. If we continue shaving yaks, justifying that by technological limitations, how will we continue earning more than many other professionals? Are we able to formulate business problem solutions in a concise and easy-to-prove-correctness form and to leave the rest to tools to perform an automatic conversion into the correct sequence of zeros and ones?
The time for radical revision of our software development habits is NOW.While initially the CAIOS project started out of an intellectual curiosity about what would happen if we start treating the cloud as a super computer, it is now readjusting to the new reality. It will focus on delivering practically applicable solutions, enabling a dramatic reduction of operational and development costs and ironclad code security. These solutions were needed yesterday, and we can no longer afford to wait until tomorrow.More detailed information describing available solutions and future plans will follow. Stay tuned.