TSTN-028: The past, present and future of the Vera Rubin Observatory Control System Middleware

  • Tiago Ribeiro,
  • Andy Clements,
  • Dave Mills,
  • Michael Reuter and
  • Russell Owen

Latest Revision: 2022-08-05

Abstract

After researching alternatives to ADLink-OpenSpliceDDS (and DDS) we conclude that Kafka provides the best alternative for the Vera Rubin Observatory Control System (Rubin-OCS) Middleware. The middleware is the backbone of the Rubin-OCS, and is fundamental for stable operation of the observatory. The highly distributed nature of the Rubin-OCS places tight constraints in terms of latency, availability and reliability for the middleware. Here we gather information to answer common questions regarding technology choices, describe the in-house work done to obtain a stable system, highlight our concerns with the current Data Distribution Service (DDS) technology, and its potential impact for near-future/commissioning and future/operations.

Introduction

“Middleware” is the term used to describe software applications used for communication in software systems. These technologies became popular with the advent of distributed systems, which come as a solution to the problem of parallel computation. As software and hardware systems became more complex, it becomes impractical to develop and execute them in a single process and, eventually, a single node. With that, software evolved from monolithic applications, where a single program executes in a single process or node, to distributed applications, where the system is divided into a number of smaller applications; each running on their own process or node. For these applications to work together coherently they must be able to communicate with each other, thus giving origin to middleware technologies.

How a distributed system is broken down into smaller pieces is heavily dependent upon the problem. Some systems are only broken down into a small number of components each still in charge of large contexts, others are broken down into many small applications that are in charge only of small simple tasks. The latter has gained substantial popularity recently and is commonly referred to as microservices. These systems are behind many of popular large services in use today like Google and Amazon.

The architecture of distributed systems can take many shapes and forms. For instance, some systems are designed to emulate monolithic applications. The application is composed of a number of smaller applications but there is a hierarchical organization, with components at the top, in charge of communicating and operating components at the bottom. The advantage of these systems is that they are easier to understand and to maintain. Since each component is isolated from the rest of the system, and only communicates with a components on the top and bottom of the hierarchical chain, adding new components is relatively easy and have minimal impact in the system. The disadvantage is that the system is more vulnerable to outages, if a component on the top of the hierarchical chain becomes unavailable those components at the bottom also become unavailable.

More modern distributed systems have been favoring less hierarchical approaches, following the principles of `reactive systems`_. In these systems each component is designed as an independent entity that reacts to input data, be it from other components or external data services. Since these systems are designed with separation of concerns in mind (e.g. each component must be able to act independently), reactive systems are usually extremely resilient. If one component becomes unavailable, the others are expected to continue to operate, taking precautious to deal with the missing agent. At the same time, it can become quite burdensome to update and grow these systems as their complexity increases exponentially with the number of different components.

In any case, the middleware plays a crucial part in any kind of distributed system, acting as the glue that binds the system together.

The Vera Rubin Observatory Control System (Rubin-OCS) is designed following the principles of a distributed, `reactive <reactive systems>`_ architecture. The system is composed of a number of independent components that work together to execute cohesive operations.

The middleware is encapsulated with a layer of abstraction known as the Service Abstraction Layer (SAL), which uses the ADLink-OpenSpliceDDS implementation of the Data Distribution Service (DDS) message passing system.

With this high-level overview of the system in mind, we will now focus on SAL and the different aspects of the current middleware technology (ADLink-OpenSpliceDDS).

The Past

Probably the most important thing for us to consider when speaking about the past, is trying to understand why DDS was selection in the first place and then why the ADLink-OpenSpliceDDS implementation was adopted.

To put it in perspective, the first commit to the current SAL code repository dates back to August 2014, some 8 years from the time of this writing and almost a whole year before the construction first stone. For comparison, the Apache Kafka message system first stable release dates back to January 2011, whereas the DDS version 1.0 standard dates back to December 2004 (it is hard to pinpoint the initial stable release for any of the DDS implementations since the older libraries are mostly closed-source proprietary code).

At the time the middleware technology was selected, DDS was a mature standard. The technology defines a powerful real-time message system protocol with high inter-operability between platforms and programming languages. In fact, DDS has most of the important features we recognize as crucial for the Rubin-OCS, including:

  • Real-time message transfer capabilities.

    DDS is a broker-less messaging system with small overhead. As such, it is capable of high-throughput and low-latency data transfer. For example, in our internal benchmarks, DDS reached transfer rates of the order of >16kHz with millisecond latency. This is certainly way beyond the requirements of our system, which are mostly constrained by the throughput of M1M3 (REQ?) and the throughput/latency required for tracking.

    It is worth noting that the Rubin-OCS requirements are not clear with respect to latency constrains. Requirements LTS-TCS-PTG-0008 and LTS-TCS-PTG-0001 in LTS-583 only specify lead-times between 50-70ms with standard deviation of 3ms for tracking demands. At best, these requirements can only be used to determine upper limits on latency, e.g. demands must be delivered with sufficient time to account for the lead-times.

    If is safe to say that delivering the demands 3/4 ahead of the lead-time should be sufficient time for the mount to process them, which means latency around 10-20ms.

  • Durability service.

    In messaging systems, “durability” refer to the capability of a system to store published data and serve it to components that join the system afterwards. This service is crucial for a distributed system like Rubin-OCS as it guarantees that components coming online at any time are able to determine the state of the system by accessing previously published information.

    A surprising number of systems do not provide any type of durability service, especially those that are deemed “real-time”.

    This mostly boils down from the fact that most real-time capable systems are broker-less (like DDS). Nevertheless, in order to provide a durability service, a system must have some kind of broker, that can store published messages and distribute them when needed.

    DDS provides a rather elegant solution to this problem. Basically, each independent node can be configure to act as a broker for durability service. One of those systems is elected as the “master” node, which will be in charge of actually distributing the data. If the master node falls over, some other node is elected to take its place.

    As we have demonstrated in our efforts to stabilize the system (TSTN-023), this can have a huge impact in the system performance and adds considerable complexity in configuring the system.

  • The Quality of Service (QoS) dictates how messages are delivered under different network scenarios.

    DDS has an extremely rich QoS system with many configuration parameters. While this might sound like a desirable feature at a first glance, it has some serious implications. To begin with, a large number of configuration parameters also means higher complexity, which makes it harder to predict the system behavior under unexpected conditions. We have encountered many problems that were traced to unexpected behavior caused by QoS settings.

In addition to the features in DDS, it is worth mentioning that it was also already in use by other projects under the NOAO/CTIO umbrella, including the SOAR and the 4m Blanco telescopes on Cerro Pachon and Tololo respectively (see, for instance, the 4M TCSAPP Interfaces Quick Reference).

The combined in-house expertise and powerful set of features, made DDS a perfect middleware technology candidate for the Vera Rubin Observatory at the time. It is, therefore, no surprise that it was selected.

It is worth mentioning that the software engineers at the time did anticipate the potential for future updates. This led to the development of abstraction levels to isolate the middleware technology from the higher level system components, which is the idea behind SAL.

The initial version of SAL used the RTI-Connext implementation of DDS. Unfortunately, the parent component (RTI) does not provide a public license for their software. This alone adds substantial overhead to the development and deployment cycle, especially given the distributed (and mostly public) nature of the Rubin Observatory efforts. In addition to the cost of purchasing licenses, we are also required to distribute the licensed code to team member and external collaborators/vendors. Furthermore, we must also make sure collaborators are not publicising the software/license, which could have potential legal repercussions to the project.

Alternatively, the ADLink-OpenSpliceDDS implementation shows comparable benchmarks to that of RTI-Connext, with the benefit of providing a public version of their library. The public version is (usually) one major release behind the professional edition and excludes some important features we end up requiring for the production environment. Even though the public version is not suitable for a production environment, it is certainly suitable for day-to-day development and testing, especially since inter-operability is guaranteed by the DDS standards.

Given the advantages of ADLink-OpenSpliceDDS over RTI-Connext implementation, we decided to switch early on in the project. The transition required low-level of effort and had no impact on to the higher level software, which is expected for a well designed API.

The Present

At the present state of the project, we have been routinely deploying and testing a stable system comprised of the majority of the components that are part of the Rubin-OCS at the summit (e.g. production environment), the NCSA Test Stand (decommissioned in February 2022) and the Tucson Test Stand.

Achieving this stage of the project was not without its challenges related to DDS and, more specifically, with the ADLink-OpenSpliceDDS implementation. In fact, it took our team a good part of a year to be able to obtain a stable system. Most of our findings are summarized in TSTN-023.

However, even after all these efforts we still encounter DDS-related issues. As we mentioned above, some of them are a result of the choice of configuration settings, which are quite extensive in DDS. Others are related to network outages (momentarily or not), and/or fluctuations in the network traffic and how they are handled by the ADLink-OpenSpliceDDS library.

A more serious and worrisome category of issues are related to errors encountered in the ADLink-OpenSpliceDDS software stack, in particular:

  • It is common to encounter segmentation faults, one of the most serious types of software errors that are hard to investigate.

  • It is very expensive and time consuming to evaluate new releases and track down the problems far enough to provide reasonable bug reports. Then, it usually takes them a long time to reproduce and fix the problem, and if the fix appears in the next release, there are often new bugs.

  • Most or all of the stable versions we have used are a result of applying patches provided by ADLink to older releases, rather than using a new unpatched release.

  • In at least one case a patch we still use was withdrawn by the company, with no reasonable alternative.

  • We have encountered crashes on the daemon used to handle the DDS traffic, which requires restarting all components running on that particular node.

  • There are issues with the daemon that prevent us from using a more robust configuration, that would be more resilient to network outages.

In general, we believe the project is not receiving an appropriate return of investment with ADLink-OpenSpliceDDS.

Furthermore, ADLink has recently announced that the public version of OpenSpliceDDS is no longer going to be supported. Their previous policy was to keep the community/public library one major version behind the licensed edition. Nevertheless, since the announcement, it is now two major versions behind. If ADLink continues to maintain the commercial version, the public version will continue to lag farther behind, until it likely becomes impossible to use a mix of the two (the free version for development, the commercial version for deployment). However, we suspect ADLink will not continue to update/support the commercial version for long. In their announcement, they made it clear that users of their commercial library should migrate to the new and upcoming Cyclone DDS library, whereas users of the community/public edition are left with no recourse.

Altogether this situation is extremely worrisome, especially as it suggest ADLink-OpenSpliceDDS might be heading towards its end-of-life support, risking our ability to maintain the software over life of the survey. It is worth noting that this would violate a couple of our systems requirements, more specifically, requirements OCS-REQ-0006 and OCS-REQ-0022 [4], which concerns support for the expected lifetime of the project (e.g. the 10 years survey operations).

The Future

Anticipating the need to replace OpenSpliceDDS by some other middleware technology in the future, our team has been studying possible alternatives. We focused most of our efforts in protocols that support the so-called publish-subscribe model, which is the one used by DDS, but we also explored other alternatives as well. The details of our study are outside the scope of this document, however, we have categorized our findings as follows:

  • Alternative DDS implementations.

    ADLink-OpenSpliceDDS is one of many implementations of the DDS standard. Notably, RTI-Connext, which was initially used in SAL is still a viable option worth exploring. We scheduled a meeting with an engineer and a commercial representative from RTI to discuss the several questions we had with their system, both technical and licensing. Unfortunately, not much have changed since we replaced RTI-Connext with ADLink-OpenSpliceDDS, and the issues we had in the past were still relevant. It is also worth noting that their Python support is still a concern (see furthermore).

  • Lack of durability service.

    As we mentioned previously, a good fraction of message passing systems lacks support for durability service, especially those that are deemed “real-time” systems which, in general, opt to a broker-less architecture. Some examples of message systems that falls in this category are ZeroMQ and nanomessage. Both these solutions are advertised as broker-less with “real-time” capabilities. ZeroMQ is known by its simplicity and easy to use whereas nanomessage was adopted as the message system for GMT.

  • Python libraries and support for asyncio.

    With Python being a popular language, one would expect to find broad support for the majority of the message passing systems. The reality though, is that most systems provide Python support only through non-native C bindings. This is, for instance, the case with the ADLink-OpenSpliceDDS we currently use. It is also extremely rare to find message systems with native support for Python asyncio, which is heavily used in salobj.

  • Real-time capabilities.

    Although the definition of what a real-time message passing system is not well defined, it is generally accepted that they must have latency on the range of 6-20 milliseconds or better [3]. The vast majority of message passing systems claim to be capable of real-time data transport. However, because the definition of real-time is somewhat loose, it is not straightforward to verify or challenge those claims. Ultimately, these need to be put into context for a particular system and verified. For our particular case, we should be able to meet the tracking requirements with latency around 10-20ms.

    Any system we choose must first be capable of achieving these levels of latency under the conditions imposed by our system, regardless of their claims.

  • Alternative architectures.

    There are some existing frameworks both in industry and adopted by different observatories that, in principle, could provide a viable alternative to DDS as a middleware though they implement different architectures. Probably the best example of frameworks on this category is TANGO which, in turn, is designed on top of the CORBA middleware.

    Contrary to DDS, which defines a data-driven (publish-subscribe) architecture, CORBA implements an object-oriented model which is more suitable for a hierarchical system architecture. Although it would be, in principle, possible to use CORBA in a data-driven scenario, it is not what it was designed for, which makes it hard to anticipate pitfalls we could encounter in the adoption process. Therefore, even though we explored some of these alternative architectures systems, and some of them shows some promise, it seems like a larger risk than to find a suitable publish-subscribe alternative to DDS.

    Note

    It is worth noting that both CORBA and DDS standards are managed by the same organization, the Object Management Group (OMG) and both rely on the Interface Description Language (IDL).

After extensively researching alternatives to ADLink-OpenSpliceDDS (and DDS) we believe that our best alternative is Kafka.

Kafka is an open source event streaming platform that is broadly used in industry. In fact, it is already an integral part of the Rubin-OCS, as it is used in the EFD to transport the data from DDS to influxDB (SQR-034 [2]). It is also used in the LSST Alert Distribution service [1]. Overall we already have extensive in-house expertise.

The fact that we are already using Kafka in the system reliably to ingest data into the EFD gives us confidence that it is, at the very least, able to handle the overall data throughput. Our main concern is than to verify that Kafka can handle the latency requirements of our system. In principle, Kafka is advertised as a “real-time” system and numerous benchmarks exists online showing it can reach latencies at the millisecond regime. Nevertheless, it is unclear those benchmarks would be applied to our systems constrains, giving the typical message size, network architecture and other relevant factors.

We then proceeded to perform benchmarks with the intention to evaluate Kafka’s performance considering our system architecture. The results, which are detailed in TSTN-033, are encouraging. In summary, we obtain similar latency levels for both Kafka and DDS. In terms of throughput, DDS is considerably better than Kafka for smaller messages, though we obtain similar values for larger messages. It is also worth mentioning again that the overall throughput we achieve with Kafka, for small and large messages, is above our systems requirements.

Overall, our detailed study shows that Kafka would be a viable option for replacing DDS as the middleware technology in our system. For the full technical report see TSTN-033.

Transition Plan

The following transition plan establishes some important milestone with description of the expected activities. We are not attaching dates to any of the following items to allow the schedule to float as needed, given the priorities of the project.

  1. Support to Python/salobj CSCs.

    The first state of the transition plan is to add support to the Python/salobj CSCs. This stage is mostly completed as salobj was converted to Kafka and used to provide benchmarks for our initial evaluation of the platform.

    Since salobj was ported to Kafka as a “drop-in” implementation, we expect that a small number of developers in the team will start using the kafka version of salobj for development at this stage.

  2. Support to SAL.

    The next step in the process is porting SAL to kafka. This will add support for the components written in C++ and Java.

    Having support for C++ is particularly critical as one of the core components of the system (the pointing component) is written in C++.

    We do not plan to support LabView for CSCs any longer. The components that are still written in LabView (namely, ATMCS and ATPneumatics) will be converted to TCP/IP and their CSCs ported to salobj.

  3. Initial support for CI in jenkins.

    This step is probably going to happen in parallel to the SAL work and will be dedicated to adding support for running CI using the kafka version of our system in our jenkins server.

  4. Initial tests on TTS.

    Once SAL C++ is available we can start deploying the TSSW components on TTS and running some initial tracking tests. We can start doing these tests even without the Java libraries (used by the camera) so we hope to focus on the C++ libraries first.

    At this point DDS is still our main development and deployment target, e.g. both ts_salobj and ts_sal will be supporting DDS and Kafka, where DDS will remain the main target. We do not expect developers to migrate to using kafka for development yet, except for a few early adopters.

    However, for this stage of the process we expect to make the initial release candidates for salobj and SAL.

    Support for CI will still be minimum in jenkins, however since this is supposed to be a drop-in implementation we do not expect any roadblocks.

    We expect that ts_cycle_build will also be branched out to support building the system for Kafka.

    Because ts_idl will no longer be necessary, it will probably be marked for deprecation at this stage.

  5. CI fully supported in jenkins.

    At this point we expect that running CI jobs in jenkins will be fully supported.

    The main focus is to support running jobs for salobj and SAL kafka branches. We also should have in place the release jobs, that will make the SAL libraries available for C++ and Java.

    For Python/salobj, there is no additional support needed as conda packaging will be able to provide all the required artifacts.

  6. Initial tests at the summit.

    Once we have certified that we can run the system reliably on TTS, we should schedule testing at the summit. The idea is to deploy the same system that was tested on TTS at the summit and execute a 3-5 nights long AuxTel run as well as any ongoing tests with the SST. If possible it would be desirable to run both telescopes at the same time, driven by the Scheduler. However, we must be attentive to the summit calendar and Kafka testing should be done in a way to minimize impact.

    For this step, the kafka versions of both ts_salobj and ts_sal will be on a separate branch and we should rely on release candidate versions for deployment.

    Development will still target the DDS version of salobj and sal.

  7. Adoption evaluation.

    After testing at the summit concludes we will evaluate the results and produce a technote with our findings

    If the test succeeds we can keep the summit running the Kafka version of our stack.

  8. Developer migration.

    At this point TSSW developers are expected to migrate from the DDS version of salobj and SAL to the kafka version.

    All development shall target the kafka version of salobj and SAL. At this time we expect developers might have to take some time fixing potential issues with unit tests to run their components against the kafka version of the tssw-stack.

    We do not expect code changes to be necessary to adapt to kafka. However, if necessary, this will be the time to do those changes.

  9. Initial transition to Kafka.

    At this point all development and deployment will have been migrated to target the kafka version of tssw stack.

    We will start making a census of the breaking changes that are planned for a full migration to kafka. For example, the ts-idl package will be deprecated and all enumerations will migrate to ts-xml. Developers will start incorporating these changes to their software as time permits.

  10. Final transition to Kafka.

    Release of version 8 of salobj implementing breaking changes.

Summary

After considerable effort fine tuning the DDS middleware configuration, we were finally able to obtain a stable system, that is capable of operating at large scale with low middleware-related failure rate. At the current advanced state of the project, which is approaching its final construction stages, one might be tempted to accept this part of the project as concluded.

As we demonstrated, there are a number of issues hiding underneath that may pose significant problems in the future, or even be seen as violating system requirements.

Overall our experience with DDS has been frustrating and disappointing. Even though the technology is capable of achieving impressive throughput and latency, in reality, it proved to be extremely cumbersome and hard to manage and debug on large scale systems. On top of if all we also face a potentially end-of-life cycle of the adopted library, which makes the problem considerably worse.

After exploring different solutions to the problem of long-term maintenance of our middleware, we propose to replace DDS by the already in-use Kafka. Our benchmarks shows that Kafka is able to fulfill our system throughput and latency requirements. We also shown that transitioning to Kafka would require minimum effort and minimum code refactoring.

We also note that there are major advantages of transitioning to Kafka before the end of construction. For instance, developers are actively engaged with the system and motivated. Furthermore, it also gives us the opportunity to perform the transition while system uptime pressure is not as large as it will become once commissioning of the main telescope commences.

Given our development cycle and the current state of the system we expect to be able to fully transition to Kafka in a 1 to 2 deployment cycles (1-3 months approximately), with no impact to the summit and minimum to no downtime on the Tucson Test Stand. This estimate is based on the assumption that we have finished porting all our code-base to support Kafka, including the remaining salobj-based services that were not ported as part of TSTN-033 efforts as well as providing a Kafka-based version of SAL to drive the C++, LabView and Java applications. We do not anticipate spending too much time tunning Kafka, since these efforts have already been done by SQuaRe to support EFD ingestion. Overall, we expect the total efforts to take between 6 months to a year.

References

[1]

[LDM-612]. Eric Bellm, Robert Blum, Melissa Graham, Leanne Guy, Željko Ivezić, William O'Mullane, Maria Patterson, John Swinbank, and Beth Willman _for the LSST Project_. Plans and Policies for LSST Alert Distribution. 2020. Vera C. Rubin Observatory Data Management Controlled Document. URL: https://ldm-612.lsst.io/

[2]

[SQR-034]. Angelo Fausti. EFD Operations. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-034.lsst.io/

[3]

Hermann Kopetz. Real-time systems - design principles for distributed embedded applications. Volume 395 of The Kluwer international series in engineering and computer science. Kluwer, 1997. ISBN 978-0-7923-9894-3.

[4]

[LSE-62]. German Schumacher and Francisco Delgado. LSST Observatory Control System Requirements. 2019. URL, https://ls.st/LSE-62.