Top performance QA issues in IoT cloud Platforms

Today’s IoT cloud platforms promise to be superscalar. However, not every platform could live up to the high expectations. Besides, the business applications developed on top of the cloud platform by customers may have some unforeseen scalability challenges. Based on our experience working with IoT cloud QA teams that regularly scale-tested 1 Million+ connected devices, the following issues stand out the most in an IoT cloud implementation. It is important to understand that many of these issues would only surface once the QA teams have simulated the workload at the full scale with IoTIFY platform. If these bugs would have made it to the production, it would already be too late to change the architecture.

IoT Broker Limitations

Sound surprising, right? Many commercial brokers may struggle to handle too many simultaneous connection requests per second. This is partly due to underlying network infrastructure as well as OS-level limitations. DNS and TCP level load balancing is required to distribute the incoming connections and handle peak connection events. 

Besides, there are limitations on the max number of publishes per second, message payload sizes, number of subscription topics per connection etc. These problems manifest themselves once you reach the larger scale.

Overloaded Queues

Message Queues are used to decouple ingress with computing platforms. In normal scenarios, queue depths are low when the system works reliably. However, when a problem with compute path happens, queues get overloaded while the processing slows down. As a snowball effect, devices retransmit messages, overloading the queues even more. When the Compute node recovers, they are kept busy with redundant updates sent from the devices, and much time is wasted to recover the solution entirely. In most cases, the only clean way to recover the system is by flushing all the queues, potentially risking critical upstream updates. 

Slow Databases

Slow databases are the #1 performance issue in almost every IoT platform deployment. As the amount of data grows, the queries become sluggish and database write takes longer. On top of it, long-running cronjobs make the situation worse because they lock the database access for a long time. A proper implementation will use clustering, sharing, and master-slave replication to separate write and read paths. 

Excessive Loggings

Logging is the least suspect when it comes to cloud scalability issues. However, excessive logging makes it difficult to find the root cause. They also often cause high cloud costs if the retention duration is high. 

Missing traceIds 

Debugging a lost message in a million payload is not an easy task. IoTIFY has developed a write optimized database to capture all logs, payloads and state information for every client for every iteration. However, for a proper debugging on the cloud side, every incoming flow message should be appended with a unique traceId. All the logging should happen with that unique trace tag. It would be much easier to debug the flow once you could filter all the messages in the logs corresponding to that particular trace Id. 

No System Design for failure

Though failure is minimal, it is bound to happen. For example, sometimes, the infrastructure provider has an outage. Sometimes, the network connectivity is broken. Often, the software has an undetected bug due to hitting boundary conditions, which will cause the service to fail continuously. In such cases, system should be able to recover from a consistent failure.  In order to test the resiliency of the system, chaos engineering should be applied, which involves shutting down nodes randomly to judge the impact. 

Verticle scaling instead of horizontal

A common fix for performance issues is to allocate more CPU and RAM. Though this might temporarily fix the problem, it could be less cost-effective and not scalable. A proper solution would be redesigning a microservice component and scaling horizontally instead with smaller nodes. However, this may be complicated due to the legacy code and race conditions. 

Lack of Infrastructure as a Code

It is beyond belief how many one-time changes are done in production, which only certain people know about. A software update rollout is often broken because devops forgot to update the config with a new parameter. To avoid such cases, QA teams must insist on infrastructure as a code and deploy it cleanly from scratch before testing. 

The complexity of testing IoT cloud platform could be daunting. IoTIFY is a cloud native QA tool specifically designed for IoT testing. Please Contact Us, to get a free consultation for your use case. You can try the platform out yourself by signing up at IoTIFY.io