|
High
test stability is the cornerstone of every good test automation system.
Ideally, a test should fail only when there is a product bug. In
reality, tests tend to fail for other reasons too, but we should strive
to minimize those failures by either fixing or removing (and rewriting)
unstable tests.
The principle of high test stability (99.5% test
stability) has several important implications. It allows us to spend
less time chasing random test failures and more time discovering bugs in
the product. It also allows us to increase test coverage faster than
increasing support costs. It helps us land the product in a stable,
predictable manner. But most importantly, high test stability allows us
to drive testing upstream
.
Defect
prevention is the “holy grail” of quality management. There are many
studies that show that preventing a defect of entering the product is a
much cheaper option that discovering and fixing the defect after it’s
has been introduced.
Obviously there are many different ways to go about defect prevention. One of the simplest and most robust one is having a stable and fast suite of tests that can easily be run by developers prior to check-in
.
The suite has to be stable – because no one likes to debug through 100s
of unrelated failures to confirm that the code works. The suite has to
be fast, because no one likes to wait for hours for test results to come
out.
Having a stable and fast test suite allows us to practice a
continuous “develop – run tests – fix issues – repeat” cycle prior to
check-in.
About
5 years ago, we used to invest a lot in advanced failure investigation
tools, which allowed us to quickly deal with and resolve large numbers
of test failures. It turned out that having high test stability negated
the need for having advanced failure investigation tools. For the most
part, a NUnit-like report (typically XML with a XSLT transform to make
the data readable) is all the team needed. At this point of time, I
assert that having an advanced failure investigation tool may actually
be a dangerous practice for test teams as it acts as a powerful crutch
that pulls the team away from the root underlying issues that need to be
fixed – i.e. tests with low stability or a very low quality product.
There
is one important aspect of failure investigation tool development that
we need to account for though. SDETs are fundamentally software
engineers. As software engineers we like to develop software and a
failure investigation tool often presents opportunities to hone one’s
design and coding skills and to also experiment with new technologies
(web services, databases, extensibility frameworks, AI, GUI frameworks,
etc.) We clearly have to find ways to expose our SDETs to other
opportunities to improve their transferable skills (design, algorithms,
etc).
Case Study: 99.5% pass rate of the WPF tests
99.5%
is obviously an arbitrary number. The right number to strive for is
always 100%. A realistic stability goal depends on the size of the team
and on the size of the test suite, but typically revolves around 99.5%.
For example, in the WPF test team, we had a suite of about 105,000 tests
which at 99.5% pass rate produced about 500 test failures for a team of
about 30 engineers or about 18 failures per person, which seemed to be a
reasonable number (so that every SDET spends no more than about 30
minutes investigating failure every day).
With time, 99.5 became a team mantra. SDETs, SDEs, PMs actively identified with it and fought for it.
At
Microsoft, we serve hundreds of millions of customers with great
environment variability. A typical “test matrix” consists of a number of
HW configurations, OS platforms, languages, locales, display
configurations, etc. and typically contains millions of variations.
Teams typically do the following to deal with the “configuration matrix explosion” problem:
- Use matrix reduction techniques (“pairwise variation generation
”, etc.)
- Prioritize
configurations (e.g. “English, x86, FRE, 96 DPI” is higher priority
than “Portuguese, x64, CHK, 140 DPI” – no offense to Portuguese folks, I
myself am Bulgarian and I have to admit that in the general case
English is higher pri than Bulgarian).
- Create and manage a
deterministic configuration testing schedule (today we cover “English”,
tomorrow we cover “German”, etc.) in an attempt to cover all high
priority configurations.
Technique (b) specifically
often has the unfortunate effect of getting us to test mostly on
“vanilla configurations”, which results in missing bugs until late in
the testing cycle and results in “training” the tests to pass on vanilla
configurations only. Technique (c) tends to result in high testing
schedule management costs.
An easy way to combat the undesired effects of (b) and (c) is to switch to a weighted random configuration management
.
The weight of a certain config can be calculated dynamically based on
historical data (pass rate of the last time the config was run on, and
frequency of running).
One can even build predictive models that would allow the team to search for configurations that result in large number of bugs.
Case Study: WPF test execution matrix management
Prior
to early 2007, the WPF used to invest a considerable amount of effort
to plan day-to-day test execution. The team had a dedicated test lead
who managed the test execution matrix and sequence. Introduction of a
new OS to the matrix was a fairly disruptive event, necessitating a
regeneration of the whole execution sequence. Test passes took a long
time, because of completeness concerns.
At the beginning of 2007, the WPF test team switched to weighted random configuration management
. We introduced a simple Excel spreadsheet, which was used by the lab engineers to generate a random testing config
every day that was then installed in the lab and used for test
execution. The team identified about 20 configurations as high priority
configurations. 4 out of 5 days in the week, the team used a randomly
selected high pri config. Every 5th
day or so, the team explored the rest of the configurations.
The
switch to random test config generation removed the need for test
execution scheduling, resulted in additional test stabilization (because
the tests were not trained to pass on vanilla configurations) and
enabled the team to find bugs off the beaten path earlier in the
development cycle.
Developer
unit tests use exactly the same test harness as the functional tests
developed by SDETs. This enables code reuse between the unit tests and
the functional tests. It also enables developers to easily run and debug
functional tests (because they know how to do it), thus enabling defect
prevention.
Automated
tests are programs that get executed multiple times throughout the
lifecycle of the product. Having slow tests results in the following
problems:
- Delays bug discovery (because test runs take longer, need to be incomplete due to their length, etc);
- Precludes the team from effectively employing defect-prevention techniques such as TDD, test before check-in, etc;
- Wastes machine time and electricity;
- Creates
test management complexity - splitting the test suite in different
priorities (BVTs, “nightlies”, etc) and creating the need for managing
these different sub-suites, etc.
One
area that often gets overlooked is the speed of building the tests
compared to the speed of building the product. Tests often take much
longer to build because they are not properly packaged (e.g. too many
DLLs) and because no one really looks into improving the build times. In
reality having a fast product and test build enables defect prevention.
In
theory, the idea of BVTs (build verification tests : a sub-suite of high
priority tests that get run more often than the rest of the tests)
sounds good. In practice, BVTs tend to become just another “crutch” that
typically prevents test teams from addressing the root underlying
problem of test slowness and instability[1]
.
Introduction of the notion of BVTs also introduces various “management
costs” (suite management and curation, execution schedule management,
etc.) that are moving the focus away from more important activities
directly related to the quality of the product.
So I highly discourage the use of BVTs.
Case Study: WPF BVTs
In
the WPF test team, we experienced the full cycle. We started with a
suite of tests that we used to run every day. As we expanded the suite
of tests, we saw that the run times of the suite became longer and that
the stability of tests became lower. Instead of investing in fixing test
perf, stability and duplication of coverage, we decided to segment the
suite into P0s, P1s, P2s, etc. We created various processes to handle
bugs produced by P0 tests, etc. (the “hot bug” concept). Because BVTs
were treated as special high-priority tests fortified by these
additional processes, SDETs tended to add more and more tests to the BVT
suite, which in turn increased the run times of the suite, reduced its
stability, and necessitated introducing additional “BVT nomination” and
“BVT auditing” processes. We had a BVT team (!!!) whose sole purpose of
existence was handling the BVT suite. The “BVT auditing” process did not
work, so we invested in further segmenting the BVT suite into
“micro-BVTs” and “regular BVTs”, we introduced micro-BVT execution time
budgets per area, etc, etc, etc. We lived with this crazy complicated
system for years.
At the
beginning of 2007, we decided to put an end to the madness, and focused
on optimizing our run times and improving our stability. We improved our
test build times 800% and we improved test run times 500%, which
enabled us to do a complete test run for about 2 hours on 20 machines.
We did not really get rid of priorities, but we found that when the
tests are stable and fast, people tend to not care about test priority.
Today, nobody cares about priorities.
The
test result reports visibly display run times, so that SDETs, SDEs and
leads can address any excessively long tests. SDEs are also asked to
provide feedback whenever they feel that the test run times are too
long.
Run
times are reported by test (and aggregated into feature area run times)
to allow drilling into areas that take too long to run.
Tests
have both build-time dependencies (the test code, etc.) and run-time
dependencies (test metadata such as owners, priorities, test support
files, configuration details, etc.) Some teams tend to keep the run-time
dependencies on a dedicated server or in databases. In theory that
sounds great – after all databases are created for storing of data.
In
practice, storing of test metadata in databases is problematic, because
it introduces versioning issues. Handling code branch forks, RIs, FIs,
etc. becomes very difficult and error prone because you have to mirror
the branch structure on the server or in the database. Some teams have
designed elaborate processes to maintain referential integrity between
the source control system and the support servers / databases. Although
these processes may work (or can be made to work), they come with a
significant support cost and are typically not robust when left on their
own.
A better approach is to keep all test-related data in the
source control system. There are obviously exceptions (e.g. checking in
large video files may not be a good idea), and we have to be smart to
not bring the source control system to its knees, but in general this
approach is superior to maintaining separate databases, servers, etc.
It’s
a good idea to keep dev and test code in the same branch. That helps
catching and preventing build breaks as a result of breaking changes. It
also enables code reuse across the organization (SDEs can reuse test
framework pieces for unit tests, SDETs can reuse dev code for tests).
A
lot of teams tend to keep specs on SharePoint servers. While this makes
specs readily accessible and easily editable (if you can get around
some of SharePoint’s idiosyncrasies), it suffers from the following
problems:
- Spec versioning is very tricky – tracking revision history is even trickier;
- Spec discovery can be tricky – you have to know the address of the SharePoint server in addition to the source code location;
- Specs tend to get lost – as the team moves from SharePoint site to SharePoint site;
- Migration of SharePoint content (e.g. to a SE team) is trickier;
- Specs tend to become stale – folks typically don’t bother updating specs unless they are right next to the code;
- SharePoint servers may have additional maintenance costs (upgrades to new versions of SharePoint, security patching, etc.);
- Specs are not easily accessible offline (although Outlook does provide offline access facilities now).
As
a best practice, specs should be checked in right next to the source
code (or in a obvious location that makes sense, based on the source
code directory organization).
In
principle, it’s a good idea to minimize the server dependencies you
have. Having server dependencies increases the maintenance costs,
complicates the move to Sustained Engineering (because SE has to
replicate all servers and record server management knowledge), prevents
SDETs and SDEs to run tests when not connected to the network and in
general complicates the lab setup.
So we should subject every server dependency to careful examination and if we have to have servers, move them to the cloud.
Case Study: WPF server and database dependencies
At
the beginning of 2007, the WPF test team had about 30 servers and a
database-based test case management (TCM) system (its name was Tactics).
This resulted in significant maintenance costs that could no longer be
afforded. So the team decided to switch to a source-control-based TCM.
The
results were outstanding – the team reduced server dependencies to
about 6 servers (1 file server, 1 web server, 1 SharePoint server and 3
backup servers) and enabled SDETs and SDEs to build and run tests
without being connected to the network. It also practically removed the
need for any referential integrity related maintenance.
The
WPF team also checked in all specs in source depot. Specs are
bin-placed as part of the build process. This ensured that the team can
produce a build with matching specs for every release of the product and
we can track and compare different versions of the specs.
For every piece of software, there are a number of cross-cutting concerns which we call “fundamentals” or “basics”. These are:
- Accessibility
- Compatibility (forward, backward, AppCompat, etc.)
- Compliance
- Deployment, serviceability and servicing
- Globalization and localization
- Performance and scalability
- Security
- Stability
- Etc.
All
of these are extremely important and have historically represented a
significant competitive advantage for Microsoft. Testing of fundamentals
is expensive so we need to have the right supporting processes and
automated systems in place to ensure that we produce solid software by
default.
Some of these fundamentals can be integrated within
the functional testing – others typically need dedicated systems and
processes. Below, I am only presenting two of the fundamentals above.
Performance
and Scalability are two fundamentals that can and should be automated.
Ideally, the team has an automated system that runs Perf / Scalability
tests (both micro-benchmarks
and end-to-end scenarios
are
equality important) on target HW and SW configurations on every daily
build. The system runs a set of tests, which capture agreed-upon
performance goals and metrics. The system provides facilities for easy
visualization of key trends.
Having an automated Perf system
enables early discovery and fixing of Perf regressions. Due to the
domain-specific nature of the work, it’s typically necessary to have a
dedicated Performance team that develops and supports the necessary
processes, tools and systems.
Case Study: WPF performance
The
WPF team has a dedicated highly automated Performance lab. The
Performance infrastructure automatically picks up and installs daily
builds on a stable set of machines and configurations, runs a set of
performance tests, presents results (cold startup times, warm startup
times, working set size and various other metrics) and trends,
identifies regressions, and captures necessary traces for follow up
investigation.
The lab
infrastructure also allows testing and generating diffs for individual
dev changes (this feature is not really broadly used although the lab
does do dedicated perf runs of specific potentially disruptive changes.
Stress
and Stability are another fundamental that requires a dedicated process
and system. It can either be done on a dedicated pool of machines /
devices or on the devices in engineer’s offices during off-work hours.
Some teams tend to invest a lot in automated triage systems, but in my
experience these tend to be expensive to maintain so should be avoided
at first.
The Stress system can also be used for security
testing by running fuzzers, fault injection, other penetration tools
alongside with the
Case Study: WPF stress
The
WPF team has a simple dedicated stress framework, which consumes both
stress-specific test code and generalized testing blocks. The tests are
“deterministically random” i.e. a stress test failure can in theory (and
often in practice) be reproved on demand. The tests are distributed to
about 100 machines every night (these machines are in the lab). Results
get analyzed and presented by vendors in China.
The
team used to have a fairly sophisticated stress system that was able to
do preliminary triage of failure stacks, map them to existing bugs,
etc. This turned out to be an over-automated system which had a
significant support cost, so we switched to the current significantly
simpler system where stress failures are triaged by a vendor team in
China, who manage the pool of the stress machines remotely. The system
is managed by 2 engineers in China with some support from local Redmond
SDETs.
Tests
are long-lived software. Tests actually tend to live longer than the
product code they test, due to AppCompat and other reasons – in Windows
we have tests that are 20 years old, dating back to the Win16 days. In
order to build a stable test suite teams need to invest in design and
code and asset reuse.
Proper software design is an acquired
taste. It is one of those things that doesn’t just happen on its own. It
requires focus on architecture, design reviews, code reviews. It also
requires training – both on the job (e.g. through mentoring, DR, CR) and
structured training (e.g. design pattern “brownbag” sessions, etc.)
A
test suite (as any other piece of software) needs to be able to evolve.
A test suite also needs to be portable. The single best way to create a
robust, evolvable, maintainable test suite is to construct it as a
combination of small self-contained building blocks.
We employ
the principles of componentization and aggregation as opposed to
inheritance and integration when constructing test suites. In that
sense, tests are like “lego models” constructed as an aggregation of
reusable “lego blocks” weaved together by test execution and
configuration policy.
There are two major techniques for automated code verification:
- Static code analysis
- Dynamic code analysis
Static
code analysis should be performed prior to check-in to prevent the team
to check-in buggy code. Ideally tools such as PreFast, PreFix, FXCOP,
StyleCop, Polycheck should be integrated in the build environment and
any violation of their rules should result in a build break.
Dynamic
code analysis is another powerful technique done through in-situ
verification (asserts, special OS instrumentation such as that enabled
with AppVerifier and driver verifier, etc).
Product
features are created in order to enable end-to-end user scenarios. The
quality of a product feature does not matter if the e2e scenario doesn’t
work. So test teams should work closely with the PM and dev teams to
understand the e2e scenarios and to build tests that capture these e2e
scenarios early in the development cycle.
The
single fastest way to improve a team’s throughput is through training.
Of course all of us experience continuous on-the-job training (which is
one of the great things about the dynamic industry we are in), but I am a
big believer in providing a bit of a training structure for the
organization.
The fundamental training activities we engage in are:
- Team-wide brown-bag sessions:
- Sessions on fundamentals such as design patterns, data-driven testing, stress testing, perf and scalability testing, etc;
- Sessions on product-specific architecture and tools;
- Blitz sessions on product features.
- 1:1 mentoring:
- Done by leads or designated mentors;
- Typically done as part of test spec reviews, design reviews, code reviews.
- Official Microsoft training
Hopefully
the two words that jump in your mind after reading this post are
“simplicity” and “rigor”. As a general rule, I’d always trade advanced
functionality for simplicity and we keep doing that all the time in the
team. It takes a lot of spine though to truly commit to simplicity – as
engineers our natural tendency is to solve the most complicated (and
interesting) problem. |