chaos engineering Archives - SD Times

SD Times news digest: CNCF moves LitmusChaos to incubator; Firefox 96; CircleCI’s free plan

Jakub Lewkowicz — Tue, 11 Jan 2022 15:47:44 +0000

The CNCF Technical Oversight Committee (TOC) has voted to approve LitmusChaos’ move from the CNCF Sandbox to Incubation level.

LitmusChaos is an open-source chaos engineering platform that helps teams identify weaknesses and potential outages in infrastructures by inducing chaos tests in a controlled way.

“The CNCF ecosystem has helped us build a strong and vibrant community around Litmus,” said Uma Mukkara, maintainer of the Litmus project and CEO of ChaosNative. “We have received consistent feedback from users since the launch of 1.0 release last year, which helped us come up with a robust set of features and a stable platform for cloud native chaos engineering.”

Firefox 96 released

Firefox 96 was released today with changes to CSS, HTTP, APIs and more.

The hwb() function for use as a CSS color value has been implemented and Firefox now provides support for the ‘color-scheme’ property. Also, cookies sent from the same domain but using different schemes (for example http or https) are now considered to be from different sites with respect to the cookie SameSite directive.

The full list of changes that will affect developers is available here.

CircleCI adds extensive free plan

CircleCI launched a new free plan to give teams access to more build minutes, larger resource classes, and the most popular features that were formerly only available on paid plans.

The new free plan includes up to 6,000 build minutes per month, unlimited users, free resource classes on Docker, Linux, and Windows, and more.

“CircleCI is built to scale along with you. As you build on CircleCI and start to see the impact of the speed, flexibility, and power we bring your team, we have flexible, usage-based plans to support your team’s unique needs,” AJ Joshi, chief product officer at CircleCI, wrote in a blog post.

Apache Flink 2.0 now available

The Apache Flink community announced the release of Flink ML 2.0.0.

Flink ML is a library that provides APIs and infrastructure for building stream-batch unified machine learning algorithms, that can be easy-to-use and performant with (near-) real-time latency.

The new release includes a major refactor of the earlier Flink ML libraries and introduces major features that extend the Flink ML API and the iteration runtime.

The post SD Times news digest: CNCF moves LitmusChaos to incubator; Firefox 96; CircleCI’s free plan appeared first on SD Times.

SD Times news digest: Gremlin Automatic Service Discovery, WhiteHat Attack Surface Management, and Jamf’s same-day Apple OS support

Jakub Lewkowicz — Tue, 27 Apr 2021 15:51:36 +0000

Gremlin has added Automatic Service Discovery to its chaos engineering platform in an effort to help companies improve resilience and reduce downtime by identifying the various services running across distributed systems.

“The rise in popularity of microservices necessitate services functioning as first-class citizens. The infrastructure layer is becoming more abstract and engineers are increasingly thinking about their systems as a collection of services,” said Matthew Fornaciari, the CTO and co-founder of Gremlin. “We want to replicate that mental model in Gremlin and reduce the cognitive load necessary to create controlled chaos.”

Gremlin also built a new way to track reliability progress by enabling SREs and DevOps teams to click into a particular service and view the full history of events that were run over time.

More information is available here.

WhiteHat Attack Surface Management announced

WhiteHat Security released Attack Surface Management powered by Bit Discovery to offer enterprises a more streamlined way to discover, manage and secure their comprehensive attack surface.

Bit Discovery automatically generates a comprehensive inventory of exposed assets including websites, VPNs, DNS servers, IoT devices and phishing sites and security teams can use the dashboard to bring specific assets under WhiteHat’s application security service, according to the company

“Attack Surface Management Powered by Bit Discovery not only bolsters WhiteHat’s platform with innovative tools that provide a tremendous amount of value for our clients, it also advances our vision to build security into each step of the entire software development lifecycle,” said Craig Hinkley, the chief executive officer at WhiteHat Security.

Jamf announces same-day support for Apple OS releases

Jamf announced that it is prepared with same-day feature support and compatibility for Apple’s latest operating system releases including iOS 14.5, iPadOS 14.5, macOS 11.3 and tvOS 14.5 when they become available.

Jamf said that this functionality is especially useful to education customers that are looking to access education apps in the Mac App Store and make them available to students.

The company’s other products Jamf Now, Jamf Connect and Jamf Protect are also offering same-day support for the latest releases from Apple with compatibility for new operating systems.

Microsoft announces plans to end support for .NET Framework 4.5.2, 4.6 and 4.6.1

Microsoft announced it will be ending support for .NET Framework 4.5.2, 4.6 and 4.6.1 in one year. After which, it will no longer provide updates including security fixes or technical support for these versions.

There will be no change to the support timelines for any other .NET Framework version including .NET Framework 3.5 SP1, which will continue to be supported as documented on the .NET Framework Lifecycle FAQ.

Microsoft found that updating .NET Framework 4.6.2 and newer versions to support newer digital certificates for the installers would satisfy the vast majority 98% of users without them needing to make a change.

Additional details are available here.

The post SD Times news digest: Gremlin Automatic Service Discovery, WhiteHat Attack Surface Management, and Jamf’s same-day Apple OS support appeared first on SD Times.

Report finds chaos engineering can significantly decrease MTTR and increase availability

Jakub Lewkowicz — Wed, 27 Jan 2021 14:25:56 +0000

A new report revealed those who have successfully implemented chaos engineering have 99.9% or higher availability and greatly improved their mean time to resolution (MTTR).

Gremlin’s inaugural 2021 State of Chaos Engineering report found 23% of teams who frequently run chaos engineering projects had a MTTR of under 1 hour, and 60% under 12 hours.

Gartner echoed similar sentiments about the report’s availability finding by predicting that by 2023, 80% of organizations that use chaos engineering practices as part of SRE initiatives will reduce their MTTR by 90%.

According to Gremlin’s report, the highest availability groups commonly utilized autoscaling, load balancers, backups, select rollouts of deployments, and monitoring with health checks.

The most common way to monitor standard uptime was synthetic monitoring, however, many organizations reported they use multiple methods and metrics.

In the report, Gremlin also found that chaos engineering has seen much greater adoption recently, and that the practice has matured tremendously since its inception 12 years ago.

“The diversity of teams using Chaos Engineering is also growing. What began as an engineering practice was quickly adopted by SRE teams, and now many platform, infrastructure, operations, and application development teams are adopting the practice to improve the reliability of their applications,” the report stated.

While it’s still an emerging practice, the majority of respondents (60%) said that they ran at least one chaos engineering attack and more than 60% of respondents have run chaos against Kubernetes.

The most commonly run experiments reflected the top failures that companies experience, with network attacks such as latency injection at the top.

However, some companies are not adopting chaos engineering mostly due to lack of awareness, experience, and time at 80%. Less than 10% of people said that it was because of fear of something going wrong.

“It’s true that in practicing Chaos Engineering we are injecting failure into systems, but using modern methods that follow scientific principles, and methodically isolating experiments to a single service, we can be intentional about the practice and not disrupt customer experiences,” the report stated. “We believe the next stage of Chaos Engineering involves opening up this important testing process to a broader audience and to making it easier to safely experiment in more environments.”

The post Report finds chaos engineering can significantly decrease MTTR and increase availability appeared first on SD Times.

Chaos engineering in serverless environments is more useful than you’d think

Jenna Sargent Barron — Tue, 12 Jan 2021 17:00:11 +0000

Chaos engineering has been gaining a lot of traction over the last few years as it moved from its origins at Netflix to more and more companies across the industry. Many development teams use it to prevent downtime by trying to break their systems on purpose so that they can improve those systems before they cause problems down the line.

Given the resilient nature of serverless computing, based on agreements of uptime and availability by the cloud providers, it might seem that chaos engineering is one method of testing that wouldn’t be practical in serverless. But Emrah Samdan, vice president of product for Thundra, believes that serverless computing and chaos engineering actually go really well together.

Because the cloud vendor guarantees availability and scalability, when doing chaos engineering in serverless environments, the goal is not necessarily to bring down the system, but to find application-level failures, such as those caused by lack of memory or time. “The purpose of chaos experiments is not to take the whole software down but to learn from failures by injecting small, controllable failures,” Samdan said.

Some of the most common examples of chaos engineering in serverless that Samdan sees are injecting latency into serverless functions to check that timeouts work properly, and injecting failures into third-party connections.

Samdan noted that the step of chaos engineering of defining the status state is an important first step, but one that is often overlooked. “People just want to break things, but the first step is actually to understand how they actually work, what are the ups and downs of the system, what are the limits, how resilient is your system already,” he said.

He believes that determining this baseline is even more important in serverless environments. This is because what is considered normal for serverless can be very different from what is considered normal in other systems. For example, in serverless, both latency and the number of executions are very important, which isn’t as true in other systems.

Because of this, it is important that an engineering team have proper observability in place. “Chaos engineering experiments are all about asking questions to understand what actually happened during the experiment. You cannot achieve this by keeping an eye on metric charts, as they are designed to answer known questions. In order to ask questions about the unknowns of the distributed system, you need to have all three pillars of observability — logs, metrics, and traces — together and integrated. I see the adoption of correct observability still continues and we see more and more companies using modern tools for this purpose. I frankly believe that we’ll see more and more companies stepping into chaos engineering as modern observability becomes more widespread,” Samdan said.

For those looking to get started with doing chaos experiments in serverless environments, Samdan recommends starting small and starting in the staging environment. Rather than throttling all serverless functions, he advises throttling or injecting latency into one or two downstream services. “It’s not only about testing failures on your system, it’s also about testing how your team will react to these failures. So starting small is actually very encouraging to persevere for more comprehensive experiments,” Samdan said.

Like adopting any new methodology, changing culture is the biggest challenge. Chaos engineering needs to be initiatives and sponsored by higher-level folks in the company, Samdan believes. “Teams should be able to work in harmony by planning, running and evaluating the game days. We should always keep in my mind that chaos experiments are not for criticizing colleagues for the weaknesses in their modules. It’s more about fixing those weaknesses before customers get impacted and letting those colleagues grow as a result of the experiments,” said Samdan.

Samdan also advised developers to remember that chaos engineering isn’t a silver bullet for finding each and every failure. It works best when used to complement other testing methodologies like unit tests and integration tests. “However, chaos engineering taps into a very different point than other tests. It tests the resiliency of other parts of your system when one part is having some problems due to latency or any type of failures. Considering the distributed systems serverless paradigm implies, running chaos experiments become a no-brainer to reveal the hidden traps before customers reveal them on production,” he said.

The post Chaos engineering in serverless environments is more useful than you’d think appeared first on SD Times.

AWS unveils new chaos engineering tool: Fault Injection Simulator

Jakub Lewkowicz — Wed, 16 Dec 2020 19:07:19 +0000

AWS is enabling teams to address application weaknesses with the introduction of the AWS Fault Injection Simulator at is virtual AWS re:Invent 2020 conference this week. The simulator is a chaos engineering tool expected to be generally available in 2021.

According to the company, the new offering will come packed with pre-built templates for creating the desired disruptions whether that’s for server latency or database errors. It also contains controls and guardrails such as automatically rolling back or stopping the experiment if certain conditions are met. Then teams can quickly roll back to the pre-experiment state.

RELATED CONTENT: To build resilient systems, embrace the chaos

Teams will also have access to a range of fine-grained controls during the experiments to gradually or simultaneously impair how different resources perform in a production environment as it is scaled up, according to AWS in a post.

“With a few clicks in the console, teams can run complex scenarios with common distributed system failures happening in parallel or building sequentially over time, enabling them to create the real world conditions necessary to find hidden weaknesses,” AWS explained on its website.

AWS explained the new offering will be especially useful for simulating game days by creating the high-traffic conditions, a new launch or for integration right into continuous delivery pipelines so that teams can repeatedly test the impact of faults in the SDLC.

Fault Injection Simulator can be used to generate tests in many AWS services such as AWS services, such as Amazon EC2, Amazon EKS, Amazon ECS, and Amazon RDS.

The post AWS unveils new chaos engineering tool: Fault Injection Simulator appeared first on SD Times.

Gremlin isolates its resource attacks to soundproof noisy neighbors

Christina Cardoza — Wed, 18 Nov 2020 16:34:37 +0000

The software reliability company Gremlin announced three major platform updates at the Virtual KubeCon North American 2020 conference this week to ensure users can safely and securely prepare solutions for failure regardless of the Kubernetes platform. The new features are: the ability to isolate its resource attacks into a single container, support for containerd and CRI-O container runtimes, and fine-grained namespace access control.

“Kubernetes is becoming the default way to build and operate applications at many enterprises, but along with the advantage of abstraction comes uncertainty,” said Lorne Kligerman, senior director of product at Gremlin. “We’re providing DevOps teams with better tooling to understand how their Kubernetes applications will behave under various stresses, such as when a neighboring container is spiking with traffic.”

According to Kligerman, because Kubernetes enables a higher tenant density on a host and increases infrastructure utilization, it can result in a “noisy neighbor” problem for DevOps teams. For instance, scaling or problematic services can impact one another if they are in the same cluster. “If applications aren’t tested for HPA and resource limits, it’s difficult to determine if your application is decoupled enough to scale out pods independently and to know if noisy neighbors can still break services sharing the same node,” Kligerman wrote in a post.

By isolating its resource attacks into a single container, users can test individual pod scaling and resource limits, and prevent “noisy neighbors.”

The noisy neighbor problem can also result in security and access control concerns. The new fine-grained namespace access control aims to address this by ensuring only team members with correct permissions have access to specific Kubernetes objects. “This is crucial to ensuring the Chaos Engineering work an engineer is doing isn’t negatively impacting neighboring services,” the company stated in its announcement.

Lastly, support for the container runtimes containerd and CRI-O is making chaos engineering available on more platforms. The company also supports earlier versions of Amazon EKS and OpenShift, and added support for new container runtimes to be able to support the latest versions.

“By supporting these additional runtimes, customers can now run attacks across their environment, even if it’s mixed, using a single UI and API. This makes testing heterogeneous environments even easier,” Kligerman explained.

The post Gremlin isolates its resource attacks to soundproof noisy neighbors appeared first on SD Times.

Engineering practices that advance testing

Lisa Morgan — Wed, 02 Sep 2020 16:00:05 +0000

Testing practices are shifting left and right, shaping the way software engineering is done. In addition to the many types of tests described in this Deeper Look, test-driven development (TDD), progressive engineering and chaos engineering are also considered testing today.

TDD
TDD has become popular with Agile and DevOps teams because it saves time. Tests are written from requirements in the form of use cases and user stories and then code is written to pass those tests. TDD further advances the concept of building smaller pieces of code, and the little code quality successes along the way add up to big ones. TDD builds on the older concept of extreme programming (XP).

RELATED CONTENT: There’s more to testing than simply testing

“Test-driven development helps drive quality from the beginning and [helps developers] find defects in the requirements before they need to write code,” said Thomas Murphy, senior director analyst at Gartner.

Todd Lemmonds, QA architect at health benefits company Anthem, said his team is having a hard time with it because they’re stuck in an interim phase.

“TDD is the first step to kind of move in the Agile direction,” said Lemmonds. “How I explain it to people is you’re basically focusing all your attention on [validating] these acceptance criteria based on this one story. And then they’re like, OK what tests do I need to create and pass before this thing can move to the next level? They’re validating technical specifications whereas [acceptance test driven development] is validating business specifications and that’s what’s presented to the stakeholders at the end of the day.”

Progressive Software Delivery
Progressive software delivery is often misdefined by parsing the words. The thinking is if testing is moving forward (becoming more modern or maturing), then it’s “progressive.” Progressive delivery is something Agile and DevOps teams with a CI/CD pipeline use to further their mission of delivering higher-quality applications faster that users actually like. It can involve a variety of tests and deployments including A/B and multivariate testing using feature flags, blue-green and canary deployments as well as observability. The “progressive” part is rolling out a feature to progressively larger audiences.

“Progressive software delivery is an effective strategy to mitigate the risk to business operations caused by product changes,” said Nancy Kastl, executive director of testing services at digital transformation agency SPR. “The purpose is to learn from the experiences of the pilot group, quickly resolve any issues that may arise and plan improvements for the full rollout.”

Other benefits Kastl perceives include:

Verification of correctness of permissions setup for business users
Discovery of business workflow issues or data inaccuracy not detected during testing activities
Effective training on the software product
The ability to provide responsive support during first-time product usage
The ability to monitor performance and stability of the software product under actual production conditions including servers and networks

“Global companies with a very large software product user base and custom configurations by country or region often use this approach for planning rollout of software products,” Kastl said.

Chaos Engineering
Chaos engineering is literally testing the effects of chaos (infrastructure, network and application failures) as it relates to an application’s resiliency. The idea originated at Netflix with a program called “Chaos Monkey,” which randomly chooses a server and disables it. Eventually, Netflix created an entire suite of open-source tools called the “Simian Army” to test for more types of failures, such as a network failure or an AWS region or availability zone drop.

The Simian Army project is no longer actively maintained but some of its functionality has been moved to other Netflix projects. Chaos engineering lives on. In fact, Gartner is seeing a lot of interest in it.

“Now what you’re starting to see are a couple of commercial implementations. For chaos to be accepted more broadly, often you need something more commercial,” said Gartner’s Murphy. “It’s not that you need commercial software, it’s going to be a community around it so if I need something, someone can help me understand how to do it safely.”

Chaos engineering is not something teams suddenly just do. It usually takes a couple of years because they’ll experiment in phases, such as lab testing, application testing and pre-production.

Chris Lewis, engineering director at technology consulting firm DMW Group, said his firm has tried chaos engineering on a small scale, introducing the concept to DMW’s rather conservative clientele.

“We’ve introduced it in a pilot sense showing them it can be used to get under the hood of non-functional requirements and showing that they’re actually being met,” said Lewis. “I think very few of them would be willing to push the button on it in production because they’re still nervous. People in leadership positions [at those client organizations] have come from a much more traditional background.”

Chaos engineering is more common among digital disruptors and smaller innovative companies that distinguish themselves using the latest technologies and techniques.

H2: Proceed with caution

Expanding more testing techniques can be beneficial when organizations are actually prepared to do that. One common mistake is trying to take on too much too soon and then failing to reap the intended benefits. Raj Kanuparthi, founder and CEO of custom software development company Narwal, said in some cases, people need to be more realistic.

“If I don’t have anything in place, then I get my basics right, [create] a road map, then step-by-step instrument. You can do it really fast, but you have to know how you’re approaching it,” said Kanuparthi, who is a big proponent of Tricentis. “So many take on too much and try 10 things but don’t make meaningful progress on anything and then say, ‘It doesn’t work.”

The post Engineering practices that advance testing appeared first on SD Times.

There’s more to testing than simply testing

Lisa Morgan — Wed, 02 Sep 2020 13:30:44 +0000

Rapid innovation and the digitalization of everything is increasing application complexity and the complexity of environments in which applications run. While there’s an increasing emphasis on continuous testing as more DevOps teams embrace CI/CD, some organizations are still disproportionately focused on functional testing.

“Just because it works doesn’t mean it’s a good experience,” said Thomas Murphy, senior director analyst at Gartner. “If it’s my employee, sometimes I make them suffer but that means I’m going to lose productivity and it may impact employee retention. If it’s my customers, I can lose retention because I did not meet the objectives in the first place.”

Today’s applications should help facilitate the organization’s business goals while providing the kind of experience end users expect. To accomplish that, software teams must take a more holistic approach to testing than they have done traditionally, which involves more types of tests and more roles involved in testing.

“The patterns of practice come from architecture and the whole idea of designing patterns,” said Murphy. “The best practices 10 years ago are not best practices today and the best practices three years ago are probably not the best practices today. The leading practices are the things Google, Facebook and Netflix were doing three to five years ago.”

Chris Lewis, engineering director at technology consulting firm DMW Group, said his enterprise clients are seeing the positive impact a test-first mindset has had over the past couple of years.

“The things I’ve seen [are] particularly in the security and infrastructure world where historically testing hasn’t been something that’s been on the agenda. Those people tend to come from more traditional, typically full-stack software development backgrounds and they’re now wanting more control of the development processes end to end,” said Lewis. “They started to inject testing thinking across the life cycle.”

Nancy Kastl, executive director of testing services at digital transformation agency SPR, said a philosophical evolution is occurring regarding what to test, when to test and who does the testing.

“Regarding what to test, the movement continues away from both manual [and] automated UI testing methods and toward API and unit-level testing. This allows testing to be done sooner, more efficiently and fosters better test coverage,” said Kastl.

“When” means testing earlier and throughout the SDLC.

“Companies are continuing to adopt Agile or improve the way they are using Agile to achieve its benefits of continuous delivery,” said Kastl. “With the current movement to continuous integration and delivery, the ‘shift-left’ philosophy is now embedded in continuous testing.”

However, when everyone’s responsible for testing, arguably nobody’s responsible, unless it’s clear how testing should be done by whom, when, and how. Testing can no longer be the sole domain of testers and QA engineers because finding and fixing bugs late in the SDLC is inadequate, unnecessarily costly and untenable as application teams continue to shrink their delivery cycles. As a result, testing must necessarily shift left to developers and right to production, involving more roles.

“This continues to be a matter of debate. Is it the developers, testers, business analysts, product owners, business users, project managers [or] someone else?” said Kastl. “With an emphasis on test automation requiring coding skills, some argue for developers to do the testing beyond just unit tests.”

Meanwhile, the scope of tests continues to expand beyond unit, integration, system and user acceptance testing (UAT) to include security, performance, UX, smoke, and regression testing. Feature flags, progressive software delivery, chaos engineering and test-driven development are also considered part of the testing mix today.

Security goes beyond penetration testing
Organizations irrespective of industry are prioritizing security testing to minimize vulnerabilities and manage threats more effectively.

“Threat modeling would be a starting point. The other thing is that AI and machine learning are giving me more informed views of both code and code quality,” said Gartner’s Murphy. “There are so many different kinds of attacks that occur and sometimes we think we’ve taken these precautions but the problem is that while you were able to stop [an attack] one way, they’re going to find different ways to launch it, different ways it’s going to behave, different ways that it will be hidden so you don’t detect it.”

In addition to penetration testing, organizations may use a combination of tools and services that can vary based on the application. Some of the more common ones are static and dynamic application security testing, mobile application security testing, database security testing, software composition analysis and appsec testing as a service.

DMW Group’s Lewis said his organization helps clients improve the way they define their compliance and security rules as code, typically working with people in conventional security architecture and compliance functions.

“We get them to think about what the outcomes are that they really want to achieve and then provide them with expertise to actually turn those into code,” said Lewis.

SPR’s Kastl said continuous delivery requires continuous security verification to provide early insight into potential security vulnerabilities.

“Security, like quality, is hard to build in at the end of a software project and should be prioritized through the project life cycle,” said Kastl. “The Application Security Verification Standard (ASVS) is a framework of security requirements and controls that define a secure application with developing and testing modern applications.”

Kastl said that includes:

adding security requirements to the product backlog with the same attention to coverage as the application’s functionality;
a standards-based test repository that includes reusable test cases for manual testing and to build automated tests for Level 1 requirements in the ASVS categories, which include authentication, session management, and function-level access control;
in-sprint security testing that’s integrated into the development process while leveraging existing approaches such as Agile, CI/CD and DevOps;
post-production security testing that surfaces vulnerabilities requiring immediate attention before opting for a full penetration test;
and, penetration testing to find and exploit vulnerabilities and to determine if previously detected vulnerabilities have been fixed.

“The OWASP Top 10 is a list of the most common security vulnerabilities,” said Kastl. It’s based on data gathered from hundreds of organizations and over 100,000 real world applications and APIs.”

Performance testing beyond load testing
Load testing ensures that the application continues to operate as intended as the workload increases with emphasis on the upper limit. By comparison, scalability testing considers both minimum and maximum loads. In addition, it’s wise to test outside of normal workloads (stress testing), to see how the application performs when workloads suddenly spike (spike testing) and how well a normal workload endures over time (endurance testing).

“Performance really impacts people from a usability perspective. It used to be if your page didn’t load within this amount of time, they’d click away and then it wasn’t just about the page, it was about the performance of specific elements that could be mapped to shopping cart behavior,” said Gartner’s Murphy.

For example, GPS navigation and wearable technology company Garmin suffered a multi-day outage when it was hit by a ransomware attack in July 2020. Its devices were unable to upload activity to Strava’s mobile app and website for runners and cyclists. The situation underscores the fact that cybersecurity breaches can have downstream effects.

“I think Strava had a 40% drop in data uploads. Pretty soon, all this data in the last three or four days is going to start uploading to them so they’re going to get hit with a spike of data, so those types of things can happen,” said Murphy.

To prepare for that sort of thing, one could run performance and stress tests on every build or use feature flags to compare performance with the prior build.

Instead of waiting for a load test at the end of a project to detect potential performance issues, performance tests can be used to baseline the performance of an application under development.

“By measuring the response time for a single user performing specific functions, these metrics can be gathered and compared for each build of the application,” said Kastl. “This provides an early warning of potential performance issues. These baseline performance tests can be integrated with your CI/CD pipeline for continuous monitoring of the application’s performance.”

Mobile and IoT devices, such as wearables, have increased the need for more comprehensive performance testing and there’s still a lot of room for improvement.

“As the industry has moved more to cloud-based technology, performance testing has become more paramount,” said Todd Lemmonds, QA architect at health benefits company Anthem, a Sauce Labs customer. “One of my current initiatives is to integrate performance testing into the CI/CD pipeline. It’s always done more toward UAT which, in my mind, is too late.”

To affect that change, the developers need to think about performance and how the analytics need to be structured in a way that allows the business to make decisions. The artifacts can be used later during a full systems performance test.

“We’ve migrated three channels on to cloud, [but] we’ve never done a performance test of all three channels working at capacity,” said Lemmonds. “We need to think about that stuff and predict the growth pattern over the next five years. We need to make sure that not only can our cloud technologies handle that but what the full system performance is going to look like. Then, you run into issues like all of our subsystems are not able to handle the database connections so we have to come up with all kinds of ways to virtualize the services, which is nothing new to Google and Amazon, but [for] a company like Anthem, it’s very difficult.”

DMW Group’s Lewis said some of his clients have ignored performance testing in cloud environments since cloud environments are elastic.

“We have to bring them back to reality and say, ‘Look, there is an art form here that has significantly changed and you really need to start thinking about it in more detail,” said Lewis.

UX testing beyond UI and UAT
While UI and UAT testing remain important, UI testing is only a subset of what needs to be done for UX testing, while traditional UAT happens late in the cycle. Feature flagging helps by providing early insight into what’s resonating and not resonating with users while generating valuable data. There’s also usability testing including focus groups, session recording, eye tracking and quick one-question in-app surveys that ask whether the user “loves” the app or not.

One area that tends to lack adequate focus is accessibility testing, however.

“More than 54 million U.S. consumers have disabilities and face unique challenges accessing products, services and information on the web and mobile devices,” said SPR’s Kastl. “Accessibility must be addressed throughout the development of a project to ensure applications are accessible to individuals with vision loss, low vision, color blindness or learning loss, and to those otherwise challenged by motor skills.”

The main issue is a lack of awareness, especially among people who lack firsthand or secondhand experience with disabilities. While there are no regulations to enforce, accessibility-related lawsuits are growing exponentially.

“The first step to ensuring an application’s accessibility is to include ADA Section 508 or WCAG 2.1 Accessibility standards as requirements in the product’s backlog along with functional requirements,” said Kastl.

Non-compliance to an accessibility standard on one web page tends to be repeated on all web pages or throughout a mobile application. To detect non-compliant practices as early as possible, wireframes and templates for web and mobile applications should be reviewed for potential non-compliant designed components, Kastl said. In addition to the design review, there should be a code review in which development teams perform self-assessments using tools and practices to identify standards that have not been followed in coding practices. Corrective action should be taken by the team prior to the start of application testing. Then, during in-sprint testing activities, assistive technologies and tools such as screen readers, screen magnification and speed recognition software should be used to test web pages and mobile applications against accessibility standards. Automated tools can detect and report non-compliance.

Gartner’s Murphy said organizations should be monitoring app ratings and reviews as well as social media sentiment on an ongoing basis.

“You have to monitor those things, and you should. You’re feeding stuff like that into a system such as Statuspage or PagerDuty so that you know something’s gone wrong,” said Murphy. “It may not just be monitoring your site. It’s also monitoring those external sources because they may be the leading indicator.”

The post There’s more to testing than simply testing appeared first on SD Times.

Gremlin brings safety improvements to chaos engineering with Status Checks

Christina Cardoza — Tue, 23 Jun 2020 15:14:55 +0000

Gremlin wants to make it safer to experiment in production with the release of Status Checks. The new capability automatically verifies systems are healthy and ready for chaos engineering.

“More and more, companies want to do Chaos Engineering. And not only do it, but automate it. But they are concerned if they have attacks triggering automatically, it may perform a chaos attack at a bad time (say when a system is already experiencing an outage!). This is a huge concern,” Matt Schillerstrom, product manager at Gremlin, told SD Times in an email. “This is a huge safety improvement, in that it drastically mitigates the chances you break your own systems and impact customers while doing chaos engineering.”

Previously, companies would try to address safety concerns by running experiments in stage environments, then applying those findings to product. However, Gremlin explained this approach is limited and doesn’t accurately mirror what can happen in production. “Without status checks, it’s very difficult to automate chaos engineering in production. Because then you are unleashing chaos without knowing if the infrastructure is ready — or you have to check manually if it’s ready,” Schillerstrom wrote.

With Status Checks, chaos engineering can be built right into CI/CD pipelines. It comes with third-party tool integration for PagerDuty, Datadog, New Relic and more. If a monitoring tool reports an active incident, Status Check will prevent the chaos attack, according to the company.

“It’s very important to note that Gremlin doesn’t advocate for ‘chaos’ — the term chaos engineering can be a little misleading. We advocate for hypothesis-driven testing, in order to tame chaos. To better understand our systems in order to prevent chaos. It does no one any good to be attacking infrastructure that’s already under stress,” wrote Schillerstrom.

The post Gremlin brings safety improvements to chaos engineering with Status Checks appeared first on SD Times.

premium To build resilient systems, embrace the chaos

Jenna Sargent Barron — Mon, 06 Apr 2020 13:20:35 +0000

It shouldn’t be news to you to hear that software needs to be tested rigorously before being pushed to production. Over the years, countless testing methodologies have popped up, each promising to be the best one. From automated testing to continuous testing to test-driven development, there is no shortage of ways to test your software.

While there may be variations in these testing methods, they all still rely on some form of human intervention. Humans need to script the tests, which means they need to know what they’re testing for. This presents a challenge in complex environments when a number of factors could combine to produce an unintended result — one for which testers wouldn’t have thought to test.

This is where chaos engineering comes in, Michael Fisher, product manager at OpsRamp explained. Chaos engineering allows you to test for those “unknown unknowns,” he said.

According to Shannon Weyrick, vice president of architecture at NS1, chaos engineering is “the practice of intentionally introducing failures in systems to proactively identify points of weakness. Weyrick explained that aside from identifying weaknesses in a system, chaos engineering allows teams to predict and proactively mitigate problems before they turn into problems that could impact the business.

Matthew Fornaciari, CTO and co-founder of Gremlin, added that “traditional methods of testing are much more about testing how the underlying sections of the code functions. Chaos engineering focuses on discovering and validating how the system functions as a whole, especially under duress.”

Chaos engineering is considered to be part of the testing phase, but Hitesh Patel, senior director of product management at F5, believes that the core of chaos engineering goes back to the development phase. It is all about “designing software and systems in an environment that is mimicking what is really happening in the real world,” he said. This means that as a developer is writing code, they’re thinking about how failures will be injected into it down the line and as a result, they’re building more resilient systems.

“Right now, chaos engineering is more about setting that expectation when you’re building the software or when you’re building the system that failures are going to happen and that you need to design for resiliency and bake that in at the beginning of a product or software life cycle rather than trying to add that on later,” said Patel.

The history of chaos engineering
The software development industry tends to latch onto practices and methodologies developed and successfully used at large tech companies. This happened with SRE, which originated at Google, and it’s also the case with chaos engineering.

The practice first originated at Netflix almost 10 years ago when they built a tool called Chaos Money that would randomly disable production instances. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice,” Netflix wrote in a blog post.

Since then, they have created an entire “Simian Army” of tools that they say keep their cloud “safe, secure, and highly available.” Examples of tools in this Simian Army include Conformity Monkey, which finds and removes instances that don’t adhere to best practices; Latency Monkey, which introduces artificial delays to see how services respond to service degradation; and Chaos Gorilla, which simulates an outage of an entire AWS availability zone.

“With the ever-growing Netflix Simian Army by our side, constantly testing our resilience to all sorts of failures, we feel much more confident about our ability to deal with the inevitable failures that we’ll encounter in production and to minimize or eliminate their impact to our subscribers,” Netflix said.

Since then, several companies have adopted chaos engineering as part of their testing process, and it has even spawned companies like Gremlin, which provides chaos-engineering-as-a-service.

Smaller companies can benefit
While chaos engineering originated at Netflix, a large company with a complex infrastructure and environment, Patel believes that in a lot of ways, smaller companies will find it easier to implement chaos engineering. Larger companies are going to have more complex compliance, auditing, and reporting requirements. “All of those things factor in when you’re trying to do what I would call a revolutionary change in how you operate things,” said Patel. Overall, there is less red tape to cut through at smaller and medium-sized companies.

“There’s fewer people involved and I think it’s easier for a two-person team to get into a room and say ‘right, this is the right thing for the business, this is the right thing for our customers, and we can get started faster’,” said Patel.

Weyrick doesn’t entirely agree with the idea that smaller means easier. Today, even small and medium-sized applications can be complex, increasing the surface area for those unpredictable weaknesses, he explained. He believes that microservice architectures in particular are inherently complex because they involve a number of disparate, interconnected parts and are often deployed in complex and widely distributed architectures.

Fornaciari recalled being on the availability team at Amazon in 2010 as they were doing a massive migration from a monolithic to a microservices architecture. The point of the move was to decouple systems and allow teams to own their respective functions and iterate independently, and in that sense, the migration was a success.

But the migration also led the team to learn the hard way that introducing the network as a dependency between teams introduced a new class of errors. “Days quickly turned into a never-ending deluge of fire fighting, as we attempted to triage the onslaught of new issues,” said Fornaciari. “It was then that we realized the only way we were ever going to get ahead of these novel failures was to invest heavily in proactive testing via Chaos Engineering.”

Fornaciari believes that as companies start to go through what Amazon went through ten years ago, chaos engineering will be “the salve that allows those companies to get ahead of these failures, as their systems change and evolve.”

According to Weyrick, if possible, teams should try to implement chaos engineering early on in an application’s life so that they can build confidence as they scale the application.

“The depth of the chaos experiments involved may start simple in smaller companies, and grow over time,” said Weyrick.

Patel also recommends starting small. He recommends starting with a non-critical application, one that isn’t going to get your company into the news or get you dragged up to your boss’ boss if things go awry. Once an application is selected, teams should apply chaos engineering to that application end to end.

He emphasized that the most important part of this process early on is “building the muscle,” which he said is all about the people, not the technology. “Technology is great, but at the end of the day, it’s people who are using these things and putting them together,” said Patel. “And what you need to do is build the muscle in the people that are doing this. Build that subject matter expertise and do that in a safe environment. Do that in a way that they can mess up a little bit. Because nothing works right the first time when you’re doing this stuff…People can build the muscle and learn how to do these things, learn the subject matter expertise, gain confidence, and then start applying that in a broader manner. And that’s where I think a tie in with leadership comes in.”

According to Patel, having support from the top of the business will be crucial in helping companies prioritize where to apply chaos engineering. “[They’re] not just giving you aircover, but also saying we’re going to apply this in a way that makes sense to our business and to our user experience and matches where we want to go from a strategic standpoint,” said Patel. “So you’re not just applying the technology in areas that no one is going to notice. You’re applying it where you can derive the biggest customer benefit.”

Fornaciari added: “As companies grow their applications and the supporting infrastructure, they’ll undoubtedly introduce more failure modes into their system. It’s unavoidable. That’s why we call chaos engineering a practice — it’s something that must continually grow and evolve with the underlying systems.”

Embracing risk
Fisher also added that organizations will need to shift their mindsets from one of “avoiding risks at all costs” to “embracing risk to generate a greater outcome to their users.” This can be a massive cultural shift, especially for those larger, more risk-averse companies, or companies who haven’t already adopted some form of DevOps.

“The team needs to evolve from the legacy belief that production is a golden environment that should be touched as little as possible and handled with kid gloves, lest outages occur,” said Weyrick. “Chaos engineering adopts a very different mindset: that in today’s world, this legacy belief actually creates fragile systems that fail at the first unexpected and unavoidable real world problem. Instead, we can build systems that consistently prove to us that they can survive unexpected problems, and rest easier in that confidence.”

The idea of purposefully trying to break things can be especially difficult for more traditional IT managers who are used to the idea of gatekeeping changes to the production environment,” explained Kendra Little, DevOps advocate at Redgate Software. “Your inclination is, well we have to find a way to be able to test this before it gets to production,” she said. “So it’s kind of this reactionary viewpoint of as soon as I find something, I need to be able to write a test to be able to make sure that never happens again… I mean I used to very much have that perspective as an IT person, and then at a certain point, I and the higher ups in my organization as well began to realize, we can’t just be reactionary anymore. Failure is inevitable. Our system is complex enough and we need to be able to change it rapidly. We can’t just gate keep things out of there. We have to be able to change the system quickly. And there are just so many moving parts in the system and so many external factors that can impact us.”

Best practices for chaos engineering
According to Shannon Weyrick, vice president of architecture at NS1, there are three main best practices that should be followed when using chaos engineering.

Get buy-in to the chaos mindset across the team: Purposefully injecting failures into a system will require a shift in mindset. He recommends teams investigate the practice, understand the ramifications, and introduce it in small ways for legacy projects and directly for new projects. “Ensure your team knows how to run successful experiments, and minimize the blast radius to reduce or remove potential impact to customers when failures occur,” said Weyrick.
Make the experiments real: The goal of chaos engineering is to increase reliability by exploring unpredictables through experiments. To get the most out of chaos engineering, teams should conduct their experiment using the most realistic data and environments possible. He also noted that it’s important to conduct experiments on the production system because it will always contain unique and hard-to-reproduce variables.
Be sure people are part of your system: It’s important to remember that infrastructure and software are not the only parts of a system. “Before conducting chaos experiments, remember that the operators who maintain the system should be considered a part of that system, and therefore be a part of the experiments,” said Weyrick.

Do Chaos engineering on your databases too
Kendra Little, DevOps advocate at Redgate Software, brought up the point that chaos engineering is not just for software applications. It is a practice that can be applied to databases too.

Little believes that the approach to testing databases with chaos engineering remains the same as the approach one would take when testing a regular software application. A big difference, however, is that people tend to be more scared of it when it’s a database instead of an application.

“When we think about testing in production with databases it’s very terrifying because if something happens to your data, your whole company is at risk,” she said. But with chaos engineering, what you’re really doing is doing controlled testing. She explained that with this process you’re not just dropping tables or releasing things that could put your company out of business.

It’s also important to note that we’ve reached a point in database and infrastructure complexity where it’s not possible to replicate your production environment accurately, Little explained. “If we don’t have a way to learn about how to manage our databases and to learn how our code behaves in databases and production, then in many cases we’re not gonna have anywhere we can learn it. So it is, I think, just as relevant in databases.”

The post premium To build resilient systems, embrace the chaos appeared first on SD Times.