What could "simplified monitoring" of the implementation of the Web Accessibility Directive mean?
, by Detlev Fischer
The Web Directive 2016/2102 mandating the accessibilty of public web sites and apps in Europe has been around since December 2016. The monitoring methodology that describes how the implementation of the Directive should be checked was published in October 2018 as Implementing Decision (EU) 2018/1524. So how should monitoring actually be done?
The text laying out the methodology is rather general, leaving scope to determine what actual monitoring should look like in detail. The following considerations are intended to provide input for a discussion of how monitoring may be turned into a practical method. We focus on simplified monitoring since this is basically a new field of activity—for in-depth monitoring, various WCAG-based testing schemes exist across Europe.
Target groups of this article / blog entry
What kinds of methodologies for Simplified Monitoring will be adopted is of interest for several groups:
- The new Monitoring Bodies that are being set up at the State and Federal State level need to establish a practical approach for implementing the monitoring and reporting functions required by the EU.
- Public Sector Bodies (and their IT service providers) picked to be monitored have an interest in getting useful and meaningful feedback so that they can improve the accessibility of their sites and apps.
- People with disabilities have an interest in being involved in monitoring to ensure that accessibility aspects that are critical for them are not ignored. For some, it will also create employment opportunities not just with Monitoring Bodies, but possibly also with (larger) Public Sector Bodies that want to ensure that the sites, services and applications they develop and purchase are evaluated by users with disabilities.
- People and organisations providing accessibility evaluation services want to create new services for "Simplified Monitoring", so they are interested in the gestation phase of methodologies—what requirements in terms of approach, quality and cost will be set, and to what extent will these requirements be defined across the board?
- Finally, all users of web sites and apps, and especially users with disabilities have an interest that the Monitoring Activity will provide a balanced overview of the actual state of accessibility of sites and apps, and that based on this overview, policies may better target the root causes of accessibility failures.
An overview of monitoring for the Web Accessibility Directive
Before we discuss feasible options for implementing simplified monitoring, we briefly cover the two types of monitoring, the criteria for drawing up the monitoring sample, the sample size calculation, and the timeline for monitoring.
The two types of monitoring
The monitoring methodology is to be applied annually by a monitoring body installed by each European Union Member State and, in cases of federal systems, also on levels below central government. The methodology will be applied to a sample of public websites and mobile apps.
The Implementing Decision 2018/1524 describes in its Annex 1 two monitoring methods: The "simplified monitoring" applied to the full sample of sites and apps, and the "in-depth monitoring", which is only applied to a minimum of 5% of sites and apps in the full sample.
The in-depth monitoring involves a test whether sites and apps in the sample satisfy all the requirements in the referenced standard, EN 301 549 - which for web sites currently boils down to meeting WCAG 2.1 and for apps, to meeting the subset of WCAG 2.1 that is reproduced in clause 11. Software of the EN and is deemed appicable to native apps. In addition, apps must also meet requirements safeguarding the interoperability with assistive technology and the application of user-selected platform settings, e.g., font size. (We will not be looking into in-depth monitoring approaches here, which may be carried out with any testing scheme that fully maps onto WCAG 2.1.)
The sampling criteria
The sampling of websites and apps to be drawn for "simplified monitoring" from all sites and apps in scope of the Directive is one important step per annual monitoring round. (Of the full sample, a sub-sample is then selected for in-depth monitoring.) So how is the sample chosen? The Implementing Decision requires that the composition of the sample for monitoring shall take into account different criteria, e.g.
- Cover the different levels of administration from state down to local websites / apps;
- Cover a variety of services such as social protection, health, transport, education, employment and taxes, environmental protection, recreation and culture, housing and community amenities and public order and safety;
- Use input from national stakeholders, in particular organisations representing people with disabilities, on selecting websites and apps to be monitored;
- include a share of sites / apps previously monitored (more than 10%, at most 50%);
- Consider Regard (5) of the Implementing Decision which suggests that notifications from the feedback mechanism or input from the enforcement body may also inform sampling.
With regard to apps, there is also the requirement to include frequently downloaded mobile applications; to cover different operating systems (currently iOS and Android); and to test the latest app version.
The sample size calculation
The size of the sample is calculated based on the number of inhabitants of the EU Member States. For example, looking at Germany, this amounts to 1,715 websites in the first two years, from the third year onwards the number rises to 2,535 websites. Due to the federal structure, this total number will be distributed between the Federal Government and the Federal States. 88 mobile apps shall be tested. All these sites and apps are subject to simplified monitoring. Only a selection of 5% of this total sample (i.e. 86 websites in years 1 and 2, 127 in the following years) are examined in detail.
When does monitoring start?
The first monitoring period for public websites runs over two years, from 1 January 2020 to 22 December 2021. Monitoring for mobile apps does not start until 23 June 2021 and also ends on 22 December 2021. From then on, monitoring runs annually.
What is "simplified monitoring"?
It is worth noting that simplified monitoring focuses on establishing the non-conformance of websites and apps under test. This is because a limited number of checks cannot establish full conformance across the many Success Criteria of WCAG - for that, checks would have to be thorough and complete.
Still, Annex I of the Implementing Decision (EU) 2018/1524 makes it clear that just finding isolated instances (or maybe even just one instance) of non-conformance is not the aim of simplified monitoring. It is easy to find issues in nearly every site tested (Webaim estimates that only about one percent of web pages passes WCAG 2.1). The monitoring method tells us to examine, in addition to the home page, a number of pages which should be proportionate to the estimated size and complexity of the website - and for these, to look at a range of different accessibility needs.
Nine "User Accessibility Needs" and corresponding WCAG Success Criteria
Interestingly, the simplipfied monitoring methodology does not refer to WCAG but rather, to a set of nine "user accessibility needs" which map in various ways to WCAG success criteria:
- usage without vision;
- usage with limited vision;
- usage without perception of colour;
- usage without hearing;
- usage with limited hearing;
- usage without vocal capability;
- usage with limited manipulation or strength;
- the need to minimise photosensitive seizure triggers;
- usage with limited cognition.
Some of these user accessibility needs, such as usage without vocal capability, have no real correspondence to WCAG success criteria. Usage with limited cognition is partly targeted in some Success Criteria such as 2.4.6 Headings and Labels, 3.3.2 Labels or Instructions or 2.2.1 Timing Ajustable—but these are actually important criteria for all users. More specific cognitive needs are covered in Criteria on WCAG conformance level AAA, which is not a mandatory level according to the Directive.
Other user accessibility needs, like the need to minimise photosensitive seizure triggers map to exactly one Success Criterion (2.3.1 Three Flashes or Below Threshold).
Still others, like usage without vision, can be broken down into quite a number of WCAG Success Criteria (e.g. 1.1.1 Non-Text Content, 2.1.1 Keyboard, 2.4.3 Focus Order, 2.4.2 Page Titled, 4.1.2 Name, Role, Value, or 4.1.3 Status Messages).
There are also various many-to-many mappings: For example, 2.4.3 Focus Order is equally important to usage without vision as it is to usage with limited manipulation or strength (e.g. sighted keyboard users or switch users).
It seems reasonable to select a subset of Success Criteria that gives a good indication whether one of the user accessibility needs is not satisfied — if possible, preferring those that are known to be frequent and critical, but without making an explicit, "hard-wired" selection of the subset (more about that below in section Automated tests and the problem of selective optimization).
What is the role of automated tests in simplified monitoring?
The text states that "the simplified monitoring shall aim to cover the (...) user accessibility needs to the maximum extent it is reasonably possible with the use of automated tests". This is a very important statement since it has far-reaching implications for the set-up of actual methods for simplified monitoring.
Should the statement be read to indicate that the user accessibility needs listed need only be covered to the extent that they can be tested automatically? Or does it instead mean: "use automated testing to the extent that it is possible to produce valid results, and then use (additional) non-automatic testing whereever automated tests fail to provide a definitive proof of non-conformance (of the selected Success Criteria)"? The text is (perhaps deliberately) ambiguous regarding this important point.
Stated objectives of the simplified monitoring
Looking at the intended outcome of monitoring, the Directive's Article 7 "Information on the monitoring results" states:
"If deficiencies have been identified, Member States shall ensure that the public sector bodies are provided with data and information on compliance with the accessibility requirements in relation to the deficiencies of their respective websites and mobile applications, within a reasonable time and in a format helping public sector bodies to correct them" (our emphasis).
Not all Success Criteria can and should be checked, otherwise this could not be called "simplified monitoring". The effort should be as low as possible (hence the emphasis on automated tests), but the result should still be meaningful, giving a coverage across the nine user accessibiliy needs listed.
A straight copy of the automated test report run on each of the pages covered is not helpful for Public Sector Bodies and may even be misleading:
- It does not prioritise failures, mixes trivial and critical issues
- It passes (fails to fail) content that is formally correct but is deficient in terms of content
- it often contains erroneous failures (so called "false positives")
- it often fails to detect critical issues.
With the aim of "a format helping public sector bodies to correct [deficiencies]" in mind, let's look at the advantages and limitations of automated tests.
Automated tests and beyond
One big advantage of automated tests is that they arrive at results fast. A second advantage is that such a test is thorough in terms what it can detect. The issues that are found on any given page can run into the hundreds. Running an automated test usually highlights important failures that may be overlooked in human testing. What is apparent is that the reported result is unable (and does not attempt) to prioritise issues found. Some may be trivial, others absolute 'show stoppers' — barriers completely preventing the use of the site for particular users.
The problem of false positives and limited coverage
Despite claims to the contrary, there are usually some so called 'false positives' among the results of automated tests: i.e. a failure is flagged which on inspection turns out to be not a failure after all. A few examples:
- A custom control slider can only be operated with a pointer (e.g. mouse cursor or touch). An automated check may flag this element as not keyboard-operable and register this as a FAIL of 2.1.1 Keyboard. However, for keyboard users there is an alternative way to put in the value in a text input right underneath. There is no actual barrier for keyboard users - the reported failure is a false positive.
- A video shows a dying ice bear, the sound contains some soft background music. The automated check finds no video captions and flags this as a FAIL of 1.2.2 Captions (prerecorded). There is no audio content that would need a caption, so there is no problem for users without hearing.
- On a browse back link in a page navigation, the automated check finds that contrast is insufficient. However, the element is exempt from meeting 1.4.3 Text Contrast since it is an inactive user interface control - another example of a false positive.
More important are the often serious issues that automated tests are unable to find. Experts agree that only about 25-30% of failures can be detected automatically (and I would argue that most of those automated checks require an additional human check or verification).
Formal and content aspects of testing
Why is a formal check using automated testing not sufficient if it can indeed identify cases of non-conformity? The short answer is: Because automated testing cannot identify other, non-formal aspects of non-conformity.
Many Success Criteria can only be partially checked by automated tests. For example, when checking the alternative text of images, an automated test will detect only formal issues. It may reveal that an
alt attribute is missing on an image element (img). This is a clear "FAIL". In other cases, however, where an
alt attribute is present, humans need to check whether the alternative text actually makes sense: Does it describe the link target of a linked teaser graphic? Does it describe an information graphic meaningfully? Here, the automated test alone cannot determine non-compliance. It would report a "PASS" where any human check would detect a "FAIL".
The difference between formal and content-related requirements runs through many of the WCAG success criteria. Formal properties of buttons, text fields, links or HTML page titles can be checked automatically (Is the element named? Are ARIA attributes valid here? Do they use allowed values? Are elements nested correctly?), However, it is not possible to determine automatically whether such elements have a meaningful text content or whether attribute values are actually set correctly. A valid check of most success criteria like 1.1.1 Non-text Content, 1.2.2 Captions (recorded), 2.4.2 Page Titled, 3.3.2 Labels or Instructions or 4.1.2 Name, Role, Value is therefore not feasibly valid without an additional human check. Formal errors can be found and cases of non-conformity identified — but meaningless, misleading or obscure labels, headings, page titles, video subtitles, etc. would also have to be assessed as non-conforming for the test approach to be valid overall.
Automated tests and the problem of selective optimization
Past experience with automated checks applied in public monitoring schemes has shown that site owners may selectively optimize their sites when it is known which automatic checks will be applied. Accessibility problems may remain uncorrected because they will not be detected by automatic checks. The implementing decision (EU) 2018/1524 addresses this known issue only in a vague manner, by demanding in point 1.3.3:
"After each deadline to submit a report, as established in Article 8(4) of Directive (EU) 2016/2102, Member States shall review the test rules for the simplified monitoring method."
A review is not necessarily a change, a change in the success criteria and the test methods involved. However, such a change is in keeping with the spirit of the guideline. If it is ignored, the simplified monitoring may just run the same set of automatic tests (possibly elaborated by humans) with engrained gaps, failing to provide an varied snapshot of the accessibility across user accessibility needs.
Another risk is that Success Criteria which can be checked fully automatically, like 3.1.1 Language of Page, may be selected to stand in and 'saturate' a user accessibility need such as usage without vision where it demonstrably has some relevance. This might then be used to justify the exclusion of trickier Success Criteria mapping onto the same user accessibility need, like 1.1.1 Non-text Content — a check that would need an additional human step.
Automatic-only checks in conflict with the stated objective of monitoring
If the objective of simplified monitoring is to provide meaningful results to all Public Sector Bodies in the sample – and not just for 5% in the sample that are evaluated in-depth –, issues that are both critical and frequent should be included. Many of these issues can be easily identified in human checks (including checks by people with disabilities) but often remain undetected in automated tests.
Here is a sample of critical and frequent issues that may 'fly under the radar' in automatic tests:
- Custom menus (such as the main navigation in responsive views, or flyout menus) that are not keyboard-operable
- Inaccessible custom dropdowns (pseudo-selects, date pickers)
- Focus order issues with interactive content (e.g., custom dialogs)
- Invisible keyboard focus making a site unusable for keyboard users
- Critical data that cannot be entered using the keyboard
- Animated content that cannot be stopped
- Context changes, like popups that open automatically on page load
- Issues with notifications that are not perceivable for blind users
- Issues with unexpected time-outs
Criteria for a valid, efficient and inclusive simplified monitoring
Variation and comparability
One objective of simplified monitoring (and reporting) is to generate results across all EU Member States. This suggests that the same success criteria should be tested over a certain period. It is unclear whether such a selection could be coordinated across Europe — we are not aware of any efforts in this direction so far. If different Member States go their own ways here anyway, a Europe-wide comparability of compliance to specific success criteria selected for simplified monitoring will be very limited. This is less of an issue for the results of in-depth monitoring since here, all WCAG Success Criteria must be included.
If it can be assumed that Europe-wide comparability is unrealistic, it is nevertheless possible to make a uniform selection of Success Criteria mapping onto all nine user accessibility needs on the level of one Member State. This will be a monitoring design decision of the new national and sub-national monitoring bodies.
There is however, a clear disadvantage of standardisation on a select set of Success Criteria: Since it is difficult to avoid the selection becoming known, providers are invited to selectively optimize, i.e. give priority to fixing the issues that are being examined. Other issues may remain unaddressed.
Restriction to indicative results, focussing on violations that are both frequent and critical
The focus on "non-conformity" suggests that any test of one site against a selected Success Criterion may not need to be completed to all pages instances. When a clear case of non-conformance is found on a page, it may be documented and the test step for that page can be aborted. A full and concise feedback on deficiencies/barriers for people with disabilities is not feasible with the simplified monitoring anyway. Nevertheless, the limited results are valid in that they demonstrate the non-conformance of an instance on certain page. Again, coverage of both frequent and critical failures is desirable to make the (limited) result of simplified monitoring that will be returned to the Public Sector Body as meaningful as possible.
It can be assumed that instances of failures identified often suggest the existence of similar failures in other parts of the site. For the site owner, the test result would be at least provide an initial overview of barriers across all nine user accessibility needs.
Inclusion of people with disabilities
The simplified monitoring can provide people with disabilities an opportunity to test those user accessibility needs that they can experience and verify. The Implementing Decision (EU) 2018/1524 states:
"Member States may also use tests other than automated ones in the simplified monitoring." (ANNEX I, 1.3.2)
Testers with disabilities may be unable to cover all user accessibility needs, but gaps can be filled in a team-based approach. Other testers can add results for Success Criteria that those testers may not (fully) check due to their particular condition or skill set.
The question of expertise
For many Success Criteria, one can devise human checks that can be done by people who are not accessibility experts with extensive experience. Is a page title, are headings descriptive? Can I zoom into a site without content being cut off or covered? Can I see the keyboard focus at any time when I tab through a page? Does the focus order follow the visual order? Is there a control to stop an animated carousel? All these checks, and more, do not require advanced accessibility knowledge.
Testers with more expertise may be needed at other points, for example, in order to verify if automated checks that have flagged violations in the use of roles, states or duplicate IDs are actual accessibility issues or may possibly be ignored in a 'results-driven' perspective.
An team-based approach enabling mutual learning
A team-based monitoring process can provide an opportunity for testers with and without disability to learn from each other and increase their competence. This is a win-win situation:
- By working with testers with disabilities, accessibility experts familiar with standard-oriented checks become more aware of the actual impact of the issues flagged. Some issues turn out to be trivial, others are critical. They will also appreciate the amount and the severity of issues in actual use that automatic testing alone could not identify.
- By working in a team, testers with disabilities learn more about the types of barriers they identify in use and other barriers that may be a 'non-issue' for them. They will see a growth both in their technical expertise and in their proficiency in using assistive technologies and the various tools for testing — skills that will be valuable also in other types of employment.
Testing options for "simplified monitoring"
For the simplified monitoring process, several valid approaches are feasible (in our view, an automated-check-only process would not be valid). We just sketch four approaches here — others are certainly feasible.
It would probably need practical exploration (and a consideration of contextual factors) to arrive at the best, most workable approach. The process is likely to be revised until it is both inclusive and efficient, and produces meaningful results for Public Sector Bodies.
As far as the nine user accessibility needs are concerned, at least one test step relevant for the respective user accessibility need would have to be included — preferably one that can detect frequently found and critical issues. As there are so many Success Criteria critical for the needs of users without vision, it could be argued that several tests should cover this user accessibility need — preferably those that are equally critical for other groups, such as tests for 2.1.1 Keyboard or 2.4.3 Focus Order.
- Deficiency-oriented inspection: A quick human inspection identifies key deficiencies / accessibility issues and documents them in a matching Success Criterion until all nine user accessibility needs are covered for each page. This means that if visual inspection indicates weak text contrast, that contrast is measured and the non-conformity is recorded in checkpoint 1.4.3 by referencing the element affected. Further checks that are relevant for people with impaired vision may not be carried out, even if other areas, such as 1.4.4 Resize Text or 1.4.10 Reflow, also appear to be deficient.
- Check after a pre-selection of Success Criteria / checkpoints: This process would apply pre-selected checkpoints to the pages selected. The selection of Success Criteria (which should include frequent or particularly critical accessibility issues) may be pre-defined for all tests of a test period to increase comparability, or be selected at the start of the individual test. Different sets of pre-selected Success Criteria may be created that all sufficiently cover the nine user accessibility needs in different ways, so sets can be swapped. The examination itself would then run similarly to the deficiency-oriented inspection: if an indication of non-conformity is found (e.g. some text lacking sufficient contrast), a measurement is made and the instance of non-conformance is recorded for that page. The test is then aborted (and continued on the next page).
- Deficiency-based assessment by testers with disabilities: The assessment by testers with disabilities is a variant of the deficiency-based approach above. Testers examine content to the extent that it is testable for them, and record first instances of non-conformance in fitting Success Criteria until a matching user accessibility need is saturated. The assessment is continued by another team member in areas that the first tester cannot assess due to his or her disabilities. Depending on the user and disability, there may be preferences regarding the selection of Success Criteria / checkpoints. For example, a blind user may perform a check of text contrast using an automated tool while he or she may be unable to test other aspects, such as the contrast of graphics or text resizing. Care needs to be taken here to avoid systematic and predictable gaps in coverage.
- Automatic tests with a human supplementary test: Both types of testing (deficiency-based, or with pre-selected Success Criteria) may start with an automatic check (using tools like Lighthouse, Wave, axe, or Tenon, to name a few) on the pages selected. The errors flagged can then be assigned to specific Success Criteria (possibly only the pre-selected ones). For example, an incorrect use of an ARIA attribute can be considered a non-compliance of 4.1.2 "Name, Role, Value" (and in turn, may "saturate" usage without vision). As shown above, it will be important to take into account that a lack of reported issues does not mean that there is no failure. For example, an automated test report may suggest that there are no issues with an image while the additional human check may reveal that its alternative text is misleading or nonsensical. The same holds for other formal checks - most require an additional look at the content aspect (e.g. is the page title, the accessible name / role assigned, the value output, the error message, etc. correct or meaningful?).
We hope that these reflections can help shape the as yet fairly nebulous simplified monitoring methodology. We see a definite role for automated checks, but are wary of approaches that rely on them.
Automatic tools are great (and we all use them), but care must be taken that the way they are used is not limiting, does not lead to de-skilling. After all, it is the appreciation of the context of a real or apparent accessibility issue that is critical in deciding whether it can or should be called a failure. Context awareness evolves in the application of human skill, in mutual learning and discussion — not when pressing the "Audit now" button and expecting a tool to do all the work.
Finally, we think the opportunity should not be missed to create an inclusive approach that can draw on the expertise of people with disabilities as testers and experts. We also think that the process should be designed to be open for people with disabilities who are not (yet) experts but co-operate in a team to build up their expertise.