Modern web applications that serve data and content to millions of users run in clustered environments. Dealing with a huge amount of connections at the same time requires many resources. That resources are CPU time and RAM. Each and every connection produces an additional load on the server. Having just one server would require a vast amount of resources and we already know that Moore’s Law is no longer valid. That said we need multiple servers ready to respond to users’ requests. Web applications use HTTP(S) protocol which works in the request-response communication method. Upon user’s request, the server will prepare the response and send it back. Every response needs to be processed by the server and that’s the key factor in provisioning resources for the server – requests may be different, starting from simple static data like CSS or images to more demanding like requesting data from database, filtering or calculating certain data.

The time between request and response is called latency. It is obvious that we are targeting the minimum latency. The first problem is, overflowing the server with requests to the point it no longer responds in a timely manner (or it simply degrades). In order to preserve latency, we introduce clustered environments. Clusters make it easy to build reliable, scalable and more flexible services. The other aspect of a cluster is High Availability – making sure an application is available even if some servers are down. The Gartner has calculated that downtime costs $5 600 per minute.

In the long run the server clusters will save money by reducing downtimes. There is a higher entry threshold because of hardware redundancy but it helps maintain the reliable services. Users’ experience with services that are always available, fast, and error free will make them come back more often. Companies may benefit from clusters because it reduces not only the downtime but also the engineering effort, especially when it comes to system recovery.

High Availability TTPSC

How clusters work

Cluster is a bunch of servers running the same application with exactly the same configuration and communicating with each other. All instances in a cluster work together to provide high availability, reliability, and scalability. Each server can handle the request making it easier for the whole system to carry the load. From user’s perspective it is transparent, it seems like the monolith server because users still use just one URL or IP to connect. The request is then routed to the appropriate server selected based on balancing algorithm. There are many different algorithms we can choose from.

The most popular are:

  • Roundrobin – servers are used in turns and load is equally distributed
  • Least connection – the server with the least connection receives the connection
  • Source – every user will use the same server that was selected in initial connection. Users connecting from the same IP address will always reach the same server
  • Uri – users reaching the same uri (either left or right side of question mark) will be directed to the same server
  • Hdr – similar Uri but uses HTTP headers to choose the server

Usually servers are also selected based on geo-localization – routing requested to the servers that are closest to the user (based on IP localization).

Let’s put everything together:

  • We have a balancing mechanism – used by Load balancer
  • Load balancer redirects user connections to the servers in a cluster which offers uninterrupted service and session data persistence by session replication
  • Servers handle the user requests

That’s pretty much it, it’s fairly simple yet very powerful and as usual not so easy when it comes to implementation. That’s why there are many different tools for helping us to manage clusters.

What is High Availability?

When system goes down it might be a disaster. It involves costs because company can no longer earn money and effort to put system back online. There are many reasons for the server to stop responding, it could be a system failure, outage, application error to name few. To prevent unexpected downtimes we can make use of clustered environments. If more than one server is able to handle requests and one goes down, web traffic can be routed to other server which is still online. This fallback is called High availability. No matter what happens to the server, there are still other servers which can take over the traffic. The hardware redundancy makes it easy for the whole cluster to stay up even after a failure – the possibility that all nodes will be down at the same time is low.

High Availability

Amazon guarantees 99.999% (or as they call it: “five nines”) of availability for emergency response systems. Why those 3 nines in decimal part? Wouldn’t 99% be enough? Interesting enough, 99% availability means ~87 hours of unavailability per year, that’s 14 minutes a day! For systems available around the clock that is certainly too much.

Achieving High Availability (or simply HA) can be made easy when using proper tools. The most important in a toolbox is mentioned earlier Load balancer. It is a guide which tells the request which way to go to end up in the cozy arms of the server, where it gets taken care of. The great thing about Load balancers is that they know how many requests were processed by each server and maintain balance (thus the name) – every server will be equally loaded. The next thing is Load balancers know exactly what servers are connected, which of them are performing well (comparing latency of similar requests), what servers are down and they can redirect the user to servers that are still up. Thanks to HA clusters we avoid single points-of-failure. It is like a car mechanic giving you a replacement car.

High Availability in ThingWorx

In ThingWorx, HA was introduced in version 8.0. It allows to create multiple ThingWorx instances and connect a load balancer to handle the traffic. There is one master server, ready to handle requests. The others (also called slaves) are a backup, waiting patiently for the main server to go down (for any reason) to join the stage and continue the show. That approach is called Active-Passive. Such cluster has multiple nodes but only one main node which is active at the time. It helps a lot in building High availability but is not very useful in scaling up the solution. Adding new nodes will only make system more available – it is a bit like adding a nine to decimal part.

In modern IoT applications, where everything is connected, it is critical to ensure server is available. In such a case everything is about the data. Devices continuously collect the data, but their limited resources require server to store and process it to make it into a value. Applications running a ThingWorx platform are critical to the whole infrastructure – it is a central database which also allows to analyze the data and apply Machine Learning models. As an outcome edge devices can be controlled based on current situation (e.g. start watering the plants if ground is dry) or projected activities (e.g. optimizing wind farms based on weather forecasts). In the situation when ThingWorx is not available none of above could happen. That’s the main driver for using Active-Passive clustering.

High Availability Overview

ThingWorx uses Apache Zookeeper for cluster management which exposes services like:

  • Configuration management
  • Leader election
  • Synchronization
  • Naming service
  • Cluster management

Zookeeper makes it easy to add new servers (nodes) to a cluster. All we need to do is to set up new server and add it to the cluster. No restarts, no downtime. Set up the server offline, test it and connect to the cluster. Zookeeper will make sure it’s available and align the configuration between all nodes. The usual set up would contain 1 master node – elected by Zookeeper upon cluster start, and 2 or more slave nodes – always up but not handling the requests. This is always aligned with the requirements, because it introduces redundancy.

High Availability diagram

In the diagram, all connections are handled by a load balancer. Client applications and devices are no longer connecting directly. That ensures the traffic is routed to the currently active ThingWorx instance. Load balancer knows about instance connected and constantly monitors their health. When cluster node goes down, load balancer will no longer receive data from that node and will automatically redirect the request to another node.

Beyond High Availability

To handle the peaks (heavy load or traffic in certain situations) we need another approach – Active-Active clustering. That option was added in ThingWorx 9. Instead of just waiting, other nodes can actively take part and respond to traffic. Such clusters not only support HA but also provide horizontal scaling. Such type of scaling helps adding new resources to the cluster. Instead of replacing hardware in a server which of course produces downtime, new server is prepared and added seamlessly. Capacity planning is a challenge but it’s much easier to adjust when using clustered environments.

Scalability of an application can be measured by number of requests that can be handled by a server simultaneously. The point at which an application can no longer handle additional requests effectively is the limit of its scalability. It is much easier to double the number of requests by using second server than to make the same in one server, especially if the base number is high. With ThingWorx 9 it is possible to build a scalable solution to handle hundreds of thousands of devices.

If you need help with implementing new or upgrading existing solutions based on ThingWorx, please contact us.

_All posts in this category

blogpost
Articles

Lesson Learned Explained: Implementing a Continuous Innovation Program in the Defense Sector

In the fast-paced aviation and defense industry, one of our clients faced a key challenge: how to accelerate the adoption of modern technologies and maintain competitiveness. The solution? Implementing a Continuous Innovation Program as the foundation of a new business model. A crucial aspect of this program was the continuous testing of state-of-the-art technologies to bring increasingly innovative products to market.

Read more
blogpost
Articles

Lesson Learned Explained: Advanced Digital Manufacturing, AR/VR, and HoloLens in the Pharmaceutical Industry

A pharmaceutical company aimed to enhance its innovation by actively testing modern technologies. A key challenge was skillfully and effectively integrating technological innovations into the production area so that data could be collected and analyzed in real-time. The company wanted to show that it is in the "close peloton" of digitalization of production, thereby increasing its market competitiveness.

Read more
blogpost
Articles

Lesson Learned Explained: Systems Integration and Data Modeling for Improved Semiconductor Manufacturing

A company in the electronics industry specializing in semiconductor manufacturing set a major goal to make improvements that positively affect the quality of final products. A key element was to monitor and identify correlations that would predict the satisfactory quality of products coming off the production line. This was done using data from machines and quality control stations, which was then subjected to in-depth analysis. This enabled the company to better understand which factors affect the quality of their products.

Read more
blogpost
Articles

Lesson Learned Explained: Improving monitoring, production stability, and product quality in the automotive industry

An automotive company needed a solution to monitor production to improve the quality of final products. The key element was identifying issues by analyzing quality data correlated with production data. Special attention was given to the casting and cooling zones, where product quality was particularly variable.

Read more
blogpost
Articles

Lesson Learned Explained: Data visualization in components manufacturing for automatics

A global company in the electrical accessories manufacturing industry used in automation faced the challenge of improving key performance indicators (KPIs), particularly increasing the availability and efficiency of production cells. Each workstation involved multiple stages of assembly and production across various positions and locations, requiring a coordinated approach to managing work, materials, and proper planning.

Read more
blogpost
Articles

Lesson Learned Explained: Digitalization of reporting processes in the glass packaging manufacturing industry

A client, a leader in the glass packaging manufacturing sector, identified the need to implement an integrated production data management system to replace outdated, manual reporting methods.

Read more
blogpost
Articles

Lesson Learned Explained: Improving KPIs in the FMCG Industry through automation and data analysis on semi-automated production lines

Introduction In the highly competitive food and beverage industry, achieving optimal Key Performance Indicators (KPIs) such as availability, performance, and quality is essential for maximizing operational efficiency and profitability. A client operating semi-automated production lines was experiencing persistent underperformance in these KPIs. To address this issue, the company required a robust and precise data-driven approach […]

Read more
blogpost
Articles

Lesson Learned Explained: how proper data collection and storage proved crucial in predictive maintenance

In the aerospace and defense industry, which is characterized by particularly high requirements for precision and reliability, key performance indicators in maintenance, failure prediction or machine condition monitoring, are crucial.

Read more
blogpost
Articles

Industry 4.0 in the context of manufacturing companies

Industry 4.0, also referred to as the fourth industrial revolution, is a concept encompassing a complex process of technological and organizational transformation of companies, which began in 2013.

Read more
blogpost
Articles

OEE: is your company stuck in a manipulation trap?

If you think OEE has no secrets to you and your plant maintain highest OEE results… think again. Harsh truth is that most manufacturing plants’ OEE land somewhere between 35 and 43%. They just don’t know about that.

Read more

Let’s get in touch

Contact us