A visual inspection of Baseball Statistics through the ages

In this report we will take a look at baseball stats starting from its inception in 1861. We will explore visually how statistics like Runs, Hits and Home Runs correlate to wins. We will also look at how which school you come from affects your average salary, at which teams you would want to play for based on historical wins, and how MLB player salary has changed over the years.

The data in this report is compiled by the open source tool RapidMiner, and displayed using python packages pandas, numpy, and seaborn in a jupyter notebook from an anaconda distribution.

Starting with the columns from left to right, we see that the stat most strongly related to the sum(HR) column (the sum of home runs over a players career) is RBI's, followed by Runs and At Bat's. This lines up with our expectations that the more Runs and RBI's ( Runs Batted In ) you have, the more home runs you will hit.

The salary statistic's most significant correlation is Homeruns, which again lines up with our expectation of the more home runs you hit, the higher your salary. Next most significant is runs batted in, which has been traditionally cited as the statistic most responsible for wins by baseball afficiandos.

Our next group of stats, Runs, Hits, Doubles and Triples, are all closely correlated, as we can see clearly in the heat map below ( created using seaborn heat map against pandas correlation matrix ).

** This heatmap applied to our correlation matrix allows us to see at a glance each statistic's most highly correlated stats.

Does School Choice Matter ?

In the this section, we asked the question, "Does school choice matter if you are trying to make it into Major League Baseball?"

First, let us take a look at the schools producing the most MLB players since the year 2000. This data was created by joining the Master table to the Salary table by outer joining on playerId, and then joined to a grouped College table aggregated by counting the schoolId of each player for players whose debut was past 2000.

In the bar graph above we can see that Stanford is tied for the number one spot with California State at Fullerton for producing the most MLB players since the year 2000. In fact, California produced the most MLB players in the last 15 years taking 7 out of the top 15 schools to generate MLB players, or more generally, 47% of MLB players in the last 15 years went to schools in California.

Next let's take a look at the average salary for the year 2000s, as grouped by the schools they went to.

We see a different picture when sorting by salary of players that have entered major leagues since the year 2000. We see Missouri has the highest Average salary with 1.6 million dollars a year, and is quite a bit higher than the rest of the schools, which quickly levels out to ~$900K a year.

Still present are many California based schools, which we will see below, is contrary to the long termer picture of average salary of MLB players based on there choice of College. University of Texas at Arlington holds the number one spot of the school producing the highest average salary for baseball players of all time.

Wins by the numbers

In the chart below, we see the chart of the top 15 most winningest franchises in MLB history sorted by number of wins ( of which, many had several different teams names throughout the years ). This data was created by grouping on the franchiseId and aggregating each teams wins by summing them and sorting by those wins, in the process named teams_by_wins.

A look at Player Salaries by position

The first table in this section lists players, sorted by the average salary over the span of their career, and the sum of the positions played throughout there MLB career. The motivating question behind this section is , "Do certain positions, on average, get paid more than others?"

This data was created by aggregating the Appearances table and summing the total positions of each player which was then inner joined to the Master table on playerId, which was joined once more to an aggregated Salaries table with another inner join against the grouped and joined Appearances+Master table.

G_1b G_2b G_p G_c G_3b G_ss G_of average_salary
2 0 0 0 1,193 1,272 0 17,972,202
1,395 0 0 0 0 0 0 15,525,500
1,659 0 0 0 15 0 32 14,703,846
0 0 452 0 0 0 0 13,831,633
1,315 0 0 0 0 0 0 12,891,450
0 0 212 0 0 0 0 12,580,818
0 0 269 0 0 0 0 12,444,375
293 0 0 920 0 0 1 12,418,750
831 0 0 0 699 0 347 12,339,279
1,702 1 0 0 108 1 309 11,936,029
0 0 0 0 0 0 1,614 11,589,234
0 0 354 0 0 0 0 11,495,000
0 0 318 0 0 0 0 11,471,500
0 0 307 0 0 0 0 11,433,333
0 766 0 0 11 10 1,092 11,425,714

As we see above, the top 15 highest average salaries, grouped by how often they played a position , are as follows:

  • Pitcher (6)
  • First Base (5)
  • Outfield (2)
  • Short Stop (1)
  • Catcher (1)
  • Second Base (0)
  • Third Base (0)

Average Salary over the years

In this last section we will explore how the average salary has changed over the years since 1985, the first year we have data for. We can see a sharp and steady increase starting in 1990, with only a few small dips and a solid increase for most years ( except for 1993 and 2004 when the average salary actually decreased ). In the first section of this report we did a correlation matrix, but left out the year because it was largely irrelevant. If we had left it in however, we would see that the second most contributing feature of salary ( behind home runs ) was in fact the year in which they played. This data was created by aggregating the Salary table.

In the last graph of this report, I thought it would be fun to show the average disparity in salaries over time. We can see in this graph , that the Standard Deviation of salary ( in red ) , outpaces the growth in Average Salary ( in green ) by quite a lot. This tells us that though the average salary grew over the years, the difference between the highest paid player and the lowest paid player also grew.


In this report we have explored the data in the Baseball data set, containing baseball statistics from 1871 to 2015, which was spread across 27 different tables. We were able to visualize the data by first joining, filtering, grouping and aggregating with RapidMiner, then using pandas, numpy , and seaborn to further manipulate the data and finally display it.

We looked at correlations between offensive statistics. We studied which schools were producing the most MLB players along with average salaries out of each school. We took a look at franchise wins, and which positions paid the most. And finally we finished the report with a look at how average MLB salaries changed over the years.