Introduction

Everyday 8 million Americans board an airplane, putting their lives in the hands of the professional pilots, mechanics, flight controllers, and rampers who all make safety a priority. Airplane travel is statistically one of the safest ways to travel, with the number of deaths per passenger mile on commercial airlines in the United States between 1995 and 2000 at 3 deaths per 10 billion passenger miles traveled. ( https://en.wikipedia.org/wiki/Aviation_safety#cite_note-2 ). Compare that to the 41,945 road deaths in the US for the year 2000 ( http://www.iihs.org/iihs/topics/t/general-statistics/fatalityfacts/overview-of-fatality-facts), and 425 fatalaties for train travel ( https://oli.org/about-us/news/collisions-casulties ) and airtravel starts to look really safe.

In this article we will explore all aircraft accidents recorded by the National Travel and Safety Board ( NTSB ) from 1962 up until the present. We will look at the number of persons injured through the lens of single variables like Purpose of Flight ( for example, Air Races, Firefighting purposes, Flight Tests, Personal use, Public use , Skydiving and more ), Type of Engine ( Reciprocating propeller, Turbo Jets,Turbo Fans ) and Make/Model. We will explore how two variables interact when we look at things like how Phase of Flight ( the time during the flight the accident happened, for example at Landing or Takeoff ) and Aircraft Category combine to affect Total Injured. And finally we will bring it all together in a summary and final presentation of 3 main plots.

This data looks at 79,141 flights as recorded by NTSB from 1962 onward. This dataset contains 31 variables, 5 of which we create in order to better look at the data. We will look strictly at the USA subset of data because it contains the most information.

Let’s take a look at summary of our data.

As you can see we have factors that you would expect to see, like Date, City, State and Country. We also have Lat/Long pairs we will use to plot our accidents and we will use those coordinates to group these accidents together by state.

There is quite a bit of information in this summary so let me highlight some of the more interesting pieces:

Some interesting things appear when we start to look at the means of injuries. Per crash we have the observe the following statistics:

Averages per crash

  • 1.67 total injuries
  • 1.45 minor injuries
  • 1.22 serious injuries
  • 0.33 fatal injuries
  • 6.33 uninjured

This gives an interesting factoid that if you were in an aircraft accident, according to this data, you have a 26% chance of being injured, and a 19% chance of being seriously injured.

Mapping Airline Crashes

Here we plot every known accident from its lat/long pair and place it on a grid. It’s clear to see how the accidents themselves can define boundary lines and paint a picture of our data.

Here we can see a distrubution of states and cities in our data set . For our cities, we have limited to the top 15 most accident prone cities in order to visualize it.

Above we can see the most crashes are “Non-Fatal”, and result in “Substantial” damage, but are not destroyed.

Here we can clearly see that Cesna created the most engines that were involved in accidents. It’s important to note that this number, although high, does not mean that Cesna makes poor engines, only that it makes a lot of engines. We don’t have the numbers for non-crash statistics, but I think you would see a high number of Cesna engines in those statistics as well.

We can also see that most engine types are called ‘Reciprocating’. This is a type of piston driven engine ( similar to what you find in a car ) that drives a shaft that turns a propeller. Also commonly called ‘Props’ or ‘Prop Planes’.

The Model graph shows mostly Cessna engines ( models 152 - 172s ) and one Piper engine ( the Pa-28-140).

As we will see in the next section, Phase of Flight is one of our most correlated factors. Phase of flight details when the accident occured during the flight, such as during landing or take off.

These two charts above give us the total number of crashes per Aircraft Category scaled with the log10() function, and the second chart shows us the percentage of people injured during an accident. The higher the ratio, the more people get injured on average per crash. In the chart above, Rockets have the highest injury rate per crash at 100%, followed by Powered Parachutes.

I ended up consolidating Purpose of Flight from 21 categories into 6 base categories in order to better visualize them. Many of those categories had overlapping intents, so reducing them to 6 makes it easier to visualize our data.

Univariate Summary

Our data is structured in ~80K rows and 36 columns. Our descriptive factors are what we will use to try and correlate to the total number of injuries. Can we create a model to explain the variance in number of injuries using the variables Aircraft Category, Engine Type, Make, Model, Purpose of Flight and Phase of Flight ? Let’s find out!

A Multivariate approach

In this section, we will look at how two or more variables can combine to affect our total number of injuries. Let’s start by looking at total number of plane crashes by state. State is a variable we created for our data based on its Lat/Long pair.

As we can see California has the most accidents followed by Texas and then Florida. These have all been in the top 3 most populated states so those numbers line up with our expectations based on the number of people living there.

I’ve also color coded these by the most common source of accidents as seen through the Purpose of Flight variable.

After removing the overwhelming ‘Personal’ category, which accounted for 62% of all Airline accidents, we notice most states accidents come from Instructional flights. Second after that is ‘Aerial Application’, more commonly refered to as ‘Crop Dusting’ and is the act of spreading pesticides on crops from an airplane. Then when we had one state, Vermont, where most of the accidents come from Glider Tow. A Glider tow is where one plane pulls a Glider plane ( a plane with no engine only wings ). And one state, Wyoming had the most aviation accidents from Business Travel.

Just for posterity let’s look at the unaltered state graph.

This multivariate chart shows us the Total Injured, the Total Crashes, and Total Fatalties as looked at through the lense of Phase of Flight. We notice right away that each grouping has a different count leader, with the most injuries occuring during the TAKEOFF phase, the most number of crashes happening during the LANDING phase, and the most fatalaties during the MANUEVERING phase.

This map gives us a good feel as to total number of injuries as they occur by state. We have darkened each state according to the total number of injuries that occured there.

With this plot, we can clearly see that Personal accidents ( in green ) account for most of our crashes. We have adjusted our total number of injured with a log function in order to better visualize it, and we are using that variable to control bubble size. We notice an unusual string of red in the central united states, representing accidents involving government crashes.

In this plot we have removed the overwhelming personal accidents to get a better feel for what other types of accidents occur most often and in what geographical region. Wenotice a high concentration of Recreation injuries in the central part of the region, with very few on the east coast and a medium amount on the West Coast.

Here we have a map of the 15 most dangerous crashes in NTSB history. We have used the Total Injury count to increase the size of the points on the map, with San Marcos being the location of the one crash with the highest number of injuries.

Now Let’s take a look at a “Pairs Panel” from the Psych packgage.

As we can see, we have some pretty closely related data. We can see Pearson tests for factors Make, Model,Aircraft Damage, Phase of Flight and Purpose of Flight. Let’s start taking a look at what factors contribute to fatalities and injuries by using decision trees.

When we plot just a tree using the AmatuerBuilt category, we notice right away a huge number of fatal flights ( in green circles ) when the flight is built by an Amatuer and not a professional manufacturing company. Let’s go on to fit our Phase of Flight.

In this tree, we plot Aircraft.Category ( Airplane, Helicopter, Ballon, Gyrocraft) along with our Phase of Flight. We can see when the Phase is Maneuvering, Cruise, Approach - we have a high survivability rate. When it is Landing or Take Off, and in the list of Aircraft Types listed in the bottom left, the Fatality rate jumps dramatically.

Indeed, Landing and TakeOff account for 56% of all aviation crashes in our report.

Modeling our Data

In this section we are going to use Linear regression with Anova to correlate the factors

to the Total number of injured persons in our aviation accidents. Let’s take a look at the summary.

## 
## Call:
## lm(formula = usaFlights$Total.Injuries ~ usaFlights$Broad.Phase.of.Flight + 
##     usaFlights$Weather.Condition + usaFlights$PurposeCollapsed + 
##     usaFlights$Engine.Type + usaFlights$Aircraft.Damage + usaFlights$Amateur.Built, 
##     data = usaFlights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -2.578  -0.745  -0.259   0.255 187.136 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                  2.775681   0.334483   8.298
## usaFlights$Broad.Phase.of.FlightCLIMB       -0.057955   0.092271  -0.628
## usaFlights$Broad.Phase.of.FlightCRUISE      -0.227049   0.058833  -3.859
## usaFlights$Broad.Phase.of.FlightDESCENT     -0.242211   0.086786  -2.791
## usaFlights$Broad.Phase.of.FlightGO-AROUND   -0.083242   0.095428  -0.872
## usaFlights$Broad.Phase.of.FlightLANDING     -0.737156   0.047705 -15.452
## usaFlights$Broad.Phase.of.FlightMANEUVERING -0.041920   0.055921  -0.750
## usaFlights$Broad.Phase.of.FlightOTHER       -0.117063   0.315107  -0.372
## usaFlights$Broad.Phase.of.FlightSTANDING    -0.618402   0.102696  -6.022
## usaFlights$Broad.Phase.of.FlightTAKEOFF     -0.213779   0.049882  -4.286
## usaFlights$Broad.Phase.of.FlightTAXI        -0.964904   0.087451 -11.034
## usaFlights$Weather.ConditionIMC             -0.031305   0.245245  -0.128
## usaFlights$Weather.ConditionUNK             -0.909141   0.482240  -1.885
## usaFlights$Weather.ConditionVMC             -0.349065   0.239013  -1.460
## usaFlights$PurposeCollapsedGovernment       -0.454392   0.137991  -3.293
## usaFlights$PurposeCollapsedInstructional    -0.454232   0.085067  -5.340
## usaFlights$PurposeCollapsedPersonal         -0.303691   0.078880  -3.850
## usaFlights$PurposeCollapsedPublic           -0.301810   0.138459  -2.180
## usaFlights$PurposeCollapsedRecreation       -0.632355   0.090974  -6.951
## usaFlights$PurposeCollapsedTransport        -0.301280   0.082512  -3.651
## usaFlights$PurposeCollapsedUnknown          -0.332152   0.201236  -1.651
## usaFlights$Engine.TypeElectric              -1.207979   1.324827  -0.912
## usaFlights$Engine.TypeHybrid Rocket         -0.515233   1.897312  -0.272
## usaFlights$Engine.TypeNone                   0.355690   0.838755   0.424
## usaFlights$Engine.TypeREC, ELEC              0.027199   1.870267   0.015
## usaFlights$Engine.TypeReciprocating         -0.227361   0.074453  -3.054
## usaFlights$Engine.TypeREC, TJ, TJ            0.090768   1.325341   0.068
## usaFlights$Engine.TypeTF, TJ                -0.289154   1.871815  -0.154
## usaFlights$Engine.TypeTJ, REC, REC, TJ       1.035876   1.873958   0.553
## usaFlights$Engine.TypeTurbo Fan              0.231845   0.114762   2.020
## usaFlights$Engine.TypeTurbo Jet             -0.264626   0.163459  -1.619
## usaFlights$Engine.TypeTurbo Prop            -0.234510   0.094862  -2.472
## usaFlights$Engine.TypeTurbo Shaft           -0.054054   0.098023  -0.551
## usaFlights$Engine.TypeUnknown                0.006597   0.404957   0.016
## usaFlights$Aircraft.DamageDestroyed          1.066057   0.107267   9.938
## usaFlights$Aircraft.DamageMinor             -0.169508   0.124702  -1.359
## usaFlights$Aircraft.DamageSubstantial       -0.010900   0.100764  -0.108
## usaFlights$Amateur.BuiltNo                   0.139623   0.221292   0.631
## usaFlights$Amateur.BuiltYes                  0.074555   0.223926   0.333
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## usaFlights$Broad.Phase.of.FlightCLIMB       0.529951    
## usaFlights$Broad.Phase.of.FlightCRUISE      0.000114 ***
## usaFlights$Broad.Phase.of.FlightDESCENT     0.005261 ** 
## usaFlights$Broad.Phase.of.FlightGO-AROUND   0.383056    
## usaFlights$Broad.Phase.of.FlightLANDING      < 2e-16 ***
## usaFlights$Broad.Phase.of.FlightMANEUVERING 0.453483    
## usaFlights$Broad.Phase.of.FlightOTHER       0.710268    
## usaFlights$Broad.Phase.of.FlightSTANDING    1.76e-09 ***
## usaFlights$Broad.Phase.of.FlightTAKEOFF     1.83e-05 ***
## usaFlights$Broad.Phase.of.FlightTAXI         < 2e-16 ***
## usaFlights$Weather.ConditionIMC             0.898430    
## usaFlights$Weather.ConditionUNK             0.059411 .  
## usaFlights$Weather.ConditionVMC             0.144184    
## usaFlights$PurposeCollapsedGovernment       0.000993 ***
## usaFlights$PurposeCollapsedInstructional    9.41e-08 ***
## usaFlights$PurposeCollapsedPersonal         0.000118 ***
## usaFlights$PurposeCollapsedPublic           0.029285 *  
## usaFlights$PurposeCollapsedRecreation       3.74e-12 ***
## usaFlights$PurposeCollapsedTransport        0.000262 ***
## usaFlights$PurposeCollapsedUnknown          0.098844 .  
## usaFlights$Engine.TypeElectric              0.361884    
## usaFlights$Engine.TypeHybrid Rocket         0.785964    
## usaFlights$Engine.TypeNone                  0.671520    
## usaFlights$Engine.TypeREC, ELEC             0.988397    
## usaFlights$Engine.TypeReciprocating         0.002263 ** 
## usaFlights$Engine.TypeREC, TJ, TJ           0.945399    
## usaFlights$Engine.TypeTF, TJ                0.877234    
## usaFlights$Engine.TypeTJ, REC, REC, TJ      0.580424    
## usaFlights$Engine.TypeTurbo Fan             0.043373 *  
## usaFlights$Engine.TypeTurbo Jet             0.105481    
## usaFlights$Engine.TypeTurbo Prop            0.013440 *  
## usaFlights$Engine.TypeTurbo Shaft           0.581334    
## usaFlights$Engine.TypeUnknown               0.987002    
## usaFlights$Aircraft.DamageDestroyed          < 2e-16 ***
## usaFlights$Aircraft.DamageMinor             0.174065    
## usaFlights$Aircraft.DamageSubstantial       0.913857    
## usaFlights$Amateur.BuiltNo                  0.528084    
## usaFlights$Amateur.BuiltYes                 0.739179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.868 on 19782 degrees of freedom
## Multiple R-squared:  0.08554,    Adjusted R-squared:  0.08379 
## F-statistic:  48.7 on 38 and 19782 DF,  p-value: < 2.2e-16

Final Plots

Let’s start pulling together all of our data and wrap this report up with 3 plots that best showcase our data.

Plot 1

I like this map because it shows at a glance how aircraft accidents differ by purpose throughout the contigous United States. We can clearly see high clusters of brown,indicating Instructional crashes, along the east coast. We can also see clusters of blue in the central United States, painting us a picture of more recreational flights happening in in the central and mid west region. And finally we see Transport accidents ( accidents occuring during the transfer of goods like mail or Amazon packages ) evenly dispersed throughout. Our Public accidents, in green, are also evenly distributed , appear to make up the fewest accidents.

Plot 2

These two plots give us good insight into the safety level of each Aircraft Category ( or Aircraft Type if it’s easier to think of it that way ). The top chart shows us raw numbers of total injuries per Aircraft Type, and the second chart shows us the percentage of people that get injured when an accident occurs in that type of aircraft.

We can see that the Rocket category has 100% injury rate, indicating anyone involved in a crash aboard a Rocket was injured, which lines up with our expectations of Rockets being dangerous to fly in. Second to that we see Powered Parachutes and Gyrocrafts, which are both single manned vehicles so we expect to see a high injury rate in these categories. Helicopters seem to have an unusually high injury rate given how common they are, compared with something like a Blimp which has a suprisingly small injury rate.

Interestingly enough, the overwhelming number of injuries resulted from people flying in planes (18040 Total Injuries ), but the number of people injured on average from plane travel was the lowest (26%), indicating Planes are the safest aircraft type to crash in.

Plot 3

This last chart in our report does a great of job showing the different types of injuries and their quantities with respect to when the accident occured, using the Phase of Flight variable.

We can glean some interesting insights from this chart, like the Takeoff phase having the highest number of total injuries despite the fact that the Landing phase accounts for the most number of total crashes. From this, we can deduce that if you are an accident during the Takeoff phase, your have a higher chance of getting injured then during the Landing phase.

Finally we see that most Fatal accidents occur during the Manuevering phase. This phase has been defined as “any flight phase occuring during 0 to 1000 feet above ground level”. With the 2nd-4th most fatally dangerous phases being Takeoff, Approach and Cruise.

Reflection

Thanks to NTSB’s meticilous record keeping we were able to explore nearly 20,000 accidents in the United States alone. We took a look at how our data breaks down at the single variable level by displaying total number of crashes with respect to Type of Engine, Maker of Engine, Phase of Flight , Purpose of Flight , and Aircraft Category.

We dug deeper into our data by comparing pairs of our factors against one another with the Psych packages pairs.panel , showing us the strongest correlations occuring for Phase of Flight and Purpose of Flight.

We explored some decision trees showing us survivability of a variety of variables, including if it was built by Amateurs and probablity of fatality during different Phases of Flight.

In our final plots section, we mapped various factors onto a grid representing the United States. This gave us a good visual indication of where these types of airline accidents occured, and how many people were inured on average for each point.

Despite the scary looking graphs shown in this report, Airplane travel remains one of the safest methods of transportation, and as we have shown, Public Airline Transportation in particular has proven to be very safe.

Personal Reflection

This project has been extremely enlightening for me on a personal level, as it has shown me that data usually comes as a blob of play-dough, and your job as a data analyst is to form that playdough into a mold and show it in different aspects and different angles of it in order to highlight things about the data that wouldn’t normally be visible.

Being my first major project doing Data Exploratory Analysis ( DEA ), it took me a long time to manipulate the data into something that I could graph and show case. A large part of the struggle has been wrapping my mind around the data, and more importantly, the transformations the data has to go throw in order to display it and report on it. Getting started, or finding a starting point, towards transforming the data is the biggest hurdle. Once I got started, the ideas and visualizations started to flow more quickly.

I would have like to been able to show more correlation between variables, like showing that certain engine types had higher failure rates than others, but this data was only concerened with accidents. In order to show the details I wanted, I would need data for all flights, not only those that ended in crashes.

I would have also like to show Public Carrier crash rate, to showcase which of the major United States airline carriers were safest, but alas - we did not have the data for it.

Where to go next

If we wanted to continue our investigation into aviation accidents through the ages , we could get even better results by combining it with the flight data from faa.gov. We could join them on their flight numbers and start to get a picture of the ratio of accident free flights to flights with accidents. We could also start to look at specific Engine and airplane manufacturers, and try to correlate airline crashes as a whole to specific engines, aircraft makers and many other variables such as city/state , geographic location, and even things like time or season.