army test pilot

An Army experimental test pilot checks out cockpit electronics. (Army Redstone Test Center photo by Collin Magonigal)

AI safety is increasingly on decision-makers’ minds, from President Biden’s Executive Order full of policy suggestions to the OpenAI board trying to fire the CEO reportedly over how fast to push development. But how can government agencies and private companies put artificial intelligence through rigorous safety testing before it’s too late? While the details of the technology are very different, former Air Force flight test engineer Michael O’Connor, now a Space Force officer, argues in this op-ed there are vital lessons to be learned from the long and often lethal history of flight testing.

From “fake news” to human extinction, the potential harms from AI gone wrong range from the everyday to the existential. Governments, companies, groups, and individuals are struggling to address those risks through legislation, regulation, and corporate best practices. Some are even taking lessons from aviation, building databases of incidents, the same way aircraft near-misses and actual crashes are documented and scrutinized. And while the unpredictable and opaque nature of modern “neural network” artificial intelligence doesn’t map neatly onto the more deterministic logic of flight control systems, they are both human attempts to safely explore the cutting edge of technology.

So it’s worth studying how the military flight test world has mitigated risk over the last 120 years, evolving from the dangerous early days of the Wright Brothers and “Right Stuff” fighter pilots to modern scientific safety practices. The three crucial lessons to be learned:

  • establish design standards that consider the human operator, not just the technology in isolation;
  • test systems in accomplishing realistic tasks, not ideal conditions; and
  • systematically expand the “envelope” of safe performance.

Standards for a Plane and a Pilot

Before the first Wright Flyer took off on Dec. 17th, 1903, there had been years of research into aircraft stability and the ability of a pilot to maintain control. As the discipline of flight test evolved, many of the lessons learned were codified and documented, with the Pentagon ultimately publishing MIL-STD-1797: Flying Qualities of Piloted Aircraft. (FAA regulations have similar requirements). The standard provided quantified recommendations for how aircraft should perform, from the proper sensitivity of controls for different aircraft types to the whole plane’s stability in flight. Critically, the standard specified the behaviors not only of the aircraft alone, but aircraft performance once a human enters the control loop.

Likewise, as researchers and developers learn how AI can fail, widely sharing those lessons widely with the entire AI community can reduce risk, eventually evolving into quantifiable performance standards to set the bar for commercial AI. (While aviation standards are maintained by government agencies, other organizations, such as public-private consortia, could manage a standard for AI).

Such standards must evolve with the technology and be tailored to different applications of AI: A military or medical AI that makes life or death decisions should be subject to different levels of scrutiny than one that picks the best recipe for dinner. Most important, any standard must consider the human factor in design – particularly how human biases and blind spots affect into the training data for an AI and how it can be used in the wider world.

Testing the System

Systems testing isn’t the Hollywood version of aviation made famous by The Right Stuff, with daring test pilots pushing the physical limits of their aircraft, but it’s the majority of day-to-day test and evaluation (T&E). It’s all about making sure systems are safe for use and that they work as intended.

This is complicated enough for modern aircraft, with all their complexly interacting electronics. Trying to test and evaluate all possible uses of an AI and each potential method of abuse is a virtually infinite task. But it’s still possible – and essential – to think about the world as a complex system-of-systems and then test that complex system in the messy real world. Testing should address the interactions between AI systems, from the initial training data used to build them, to how they are deployed, employed, and modified.

Once systems testing establishes the system works as planned, it’s also critical to push the limitations of the systems – carefully! That’s what engineers call “envelope expansion.”

All aircraft have a performance envelope that they are designed to operate within. Go too fast and pressure can cause structural failure; too high and the air grows too thin to fly; too slow and you stall. That envelope is defined by the designers’ requirements — and the tradeoffs they have to make to meet those requirements.

Testing the envelope does not start at the edge. Instead, testers start in the ‘heart of the envelope’ with conditions where they have the most confidence in the design and modeling. So, after low and high-speed taxi testing, the first flight of an aircraft is often just a takeoff, a brief flight around the landing pattern, and a landing. From this initial foothold, testers conduct ‘envelope expansion’, characterizing aircraft behavior in ever more demanding conditions, step by careful step, until they show safe performance at the limits of the design. (Just because a system is designed to a particular limit does not mean that testers dwell at that limit: the fact that a particular jet fighter can go Mach 2 doesn’t mean it should, outside of a rare operational need). Results from envelope testing are used to put operational limitations on what regular operators should do.

Now, an AI’s ‘performance envelope’ may be more complex and multi-dimensional than an aircraft’s – and a constantly learning, self-modifiying AI may have an envelope that keeps on changing. AIs can be unpredictable and non-deterministic, with the same input giving rise to different outputs for opaque reasons – which can lead to undesired behaviors even within the boundaries of ‘acceptable performance’. This may require providing information about how performance characteristics, like accuracy and confidence, may change depending on the operational context.

This won’t be easy, because, the potential configurations and states of an AI are — at least for now — far harder to nail down and analyze than even the most complex aircraft. We don’t yet know whether this difficulty is because of the inherent nature of AI, because of a lack of appropriate instrumentation to measure how algorithms perform, or simply because AI has not yet accumulated the decades of experience that aviation testing is able to build on. It’s a cause for optimism that, even in other complex endeavors like medicine that concerted effort, investment, and testing has created insight into complex, multivariable systems like the human body. System complexity is not an impenetrable barrier to understanding.

Of course, there are clearly caveats to the comparison between aircraft and AI.  In particular, since imperfect software can be fixed or upgraded much faster than a flawed physical prototype, and their failures (so far) rarely kill anyone, the culture of Silicon Valley startups and even major IT firms is much more tolerant of risk — “move fast and break things” — than aircraft flight-safety testers. As AI moves into vehicles and those vehicles gain more autonomy, however, this line will blur, which means the two cultures must find safe common ground.

Despite these caveats, regulators and developers would do well to understand how other communities have solved other problems of risk in developing technologies. That includes top level lessons that may apply across domains:

  1. Test. The fundamental way to learn how a system behaves, be it AI or aerospace, is to test it.
  2. Consider the Human. While a rigorous, formal standard may work better for aircraft than AI, limitations always depend on use cases. In all cases, considering how the human interacts with the system is mandatory, and establishing standards may be advisable.
  3. Complexity. While every permutation and use case can’t be predicted when a new system is fielded, perceptive testing can mitigate risk. Equally important is continued testing as the technology and the environment evolve. This testing is not limited solely to optimizing performance, but instead focuses on understanding how performance changes as the operational context shifts.
  4. Define the Envelope. An envelope should set limits where the AI should be used as well as the acceptable behaviors for the AI. It can include controls and limiters built into the software to enforce those environments and behaviors, or at least alert operators when they are breached. At the very least, human systems, processes, and procedures could be put in place to mitigate undesirable behaviors. As the AI learns and adapts, the AI, your limits, and your assumptions must also be regularly tested. And just as an AI may change, so too may the operating environment, and so may the range of acceptable behaviors. And importantly, just because a limit keeps people safe most of the time doesn’t mean it always will.

As with any emerging technology, there are parallels to history. All lessons may not apply exactly, but taking the hard-earned knowledge and mindset of military flight test may benefit the future users of tomorrow’s AI systems.

Michael O’Connor is a U.S. Space Force Fellow at the Center for Security and Emerging Technology (CSET), Georgetown University. Prior to joining CSET, he was a space program test lead at Los Angeles Air Force Base, Calif. He previously served as an evaluator flight test engineer supporting remotely piloted aircraft testing at the Air Force Test Center at Edwards Air Force Base, Calif. The views expressed in this article are those of the author and do not necessarily reflect the official policy or position of the Department of the Air Force, the Department of Defense, or the U.S. Government.