Introduction

Introduction#

The Financial Benefits of Your Major#

Welcome to Data 88E: Economics Models! This class will explore the intersection of Data Science and Economics. Specifically, we will utilize methods and techniques in data science to examine both basic and upper-division Economics concepts. Throughout the course, we will consider a variety of economic problems both on the macro and micro level.

In the first demo of the course, we hope to give you a sense of the types problems you can expect to explore this semester by considering a problem that may be of personal relevance to you: the post-graduate incomes of different majors at Cal.

We will be using various visualization techniques to analyze the median incomes of different majors at UC Berkeley, in addition to the median incomes of those same majors at other colleges. If you forgot, the median income is the “middle” value: if you sorted all the individual incomes of a major in ascending order, the median would be the value that’s exactly in the middle. The median is also called the 50th percentile – at the median, exactly 50% of the individuals have an income lower than the median.

Do not be concerned if you don’t understand the code below: this entire exercise is purely a demo to motivate many profound concepts in Economics. If you’re interested, you may choose to come back to this demo at the end of the course and consider all the different techniques utilized in it - it’d be a great way of reflecting upon how much you’ve learnt!

Data Collection#

Before we can use data science to tackle any issue, we must–well–obtain data (kind of mind-boggling, I know). Recall that we want to examine the median incomes of different majors at UC Berkeley as well as the median incomes of those same majors at other colleges. The term ‘other colleges’ is a fairly general one, and in this case we shall consider the average median incomes of those majors at alll other colleges in the United States.

In order to obtain a dataset, you can either collect it yourself (via surveys, questionnaires, etc.) or you can use datasets that others have gathered for you. In this demo, we are combining 3 different datasets:

The median income for each major at Cal was obtained from Cal’s 2019 First Destination survey.
The median income for each major overall was obtained from surveys conducted by the American Community Survey (ACS) from 2010 to 2012, a very popular data source for Economics Research! In the survey, ACS essentially calls college graduates and asked them their income as well as what they majored in at college. (As a side note, FiveThirtyEight later published this article:The Economic Guide to Picking a College Major using the results of the survey.) In this project, we will be using a modified version of the ACS survey - we will only be looking at the respondents who are 28 or younger. Can you think of why we would do this?
The longitudinal data on long-run outcomes of UC Berkeley alumni was obtained from the University of California webpage. We will use this dataset later for a slightly different analysis.

Take a moment to consider the ways in which the 3 different datasets were created. Is it fair to draw direct comparisons between the datasets? What would be some potential issues and how could the differences in our datasets affect our analysis?

Mean vs Median#

Before proceeding further, it is important to consider why we are choosing to look at the median, and not the average, income. In order to answer this question, let us think about what the distribution of incomes for a population would look like. Most likely, you would see a high amount of incomes around or slightly below the mean, with a few massive outlier incomes above the mean. For example, consider a theatre major who becomes a star on Broadway - while they’d be doing absolutely fantastic in their career, they are not representative of the average theatre graduate from Berkeley and would likely pull the average income way up. For this reason, using the median is more robust: it gives us a better idea of what the typical graduate for any major can generally expect to earn.

Now we’ll load in all the data. Take a look at the tables for each dataset. Note that P25th referes to the 25th percentile of incomes (the income level at which exactly 25% of incomes are lower) and P75th refers to the 75th percentile of incomes (the income level at which exactly 75% of incomes are lower). You may not know what all the different columns in the tables mean. That’s okay! As data scientists, we often encounter a lot of irrelevant data that we will discard later.

# Load in table of all majors' median incomes at Cal
cal_income = Table.read_table("cal_income.csv")
cal_income.show(10) 

Major	Cal Median	Cal P25th	Cal P75th
American Studies	55000	41600	60000
Anthropology	41600	36500	51000
Applied Mathematics	80004	65000	108000
Landscape Architecture	52000	45760	60000
Art	48880	38640	56390
Astrophysics	60000	50800	77182
Bioengineering	71000	54997	86500
Business Administration	75000	65000	85000
Chemical Biology	49920	44000	68000
Chemical Engineering	70000	65000	80000

... (39 rows omitted)

# Load in table of all other universities' average major median incomes
other_income = Table.read_table("recent-grads.csv") 
other_income.show(10)

Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	Full_time	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
1	2419	PETROLEUM ENGINEERING	2339	2057	282	Engineering	0.120564	36	1976	1849	270	1207	37	0.0183805	110000	95000	125000	1534	364	193
2	2416	MINING AND MINERAL ENGINEERING	756	679	77	Engineering	0.101852	7	640	556	170	388	85	0.117241	75000	55000	90000	350	257	50
3	2415	METALLURGICAL ENGINEERING	856	725	131	Engineering	0.153037	3	648	558	133	340	16	0.0240964	73000	50000	105000	456	176	0
4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258	1123	135	Engineering	0.107313	16	758	1069	150	692	40	0.0501253	70000	43000	80000	529	102	0
5	2405	CHEMICAL ENGINEERING	32260	21239	11021	Engineering	0.341631	289	25694	23170	5180	16697	1672	0.0610977	65000	50000	75000	18314	4440	972
6	2418	NUCLEAR ENGINEERING	2573	2200	373	Engineering	0.144967	17	1857	2038	264	1449	400	0.177226	65000	50000	102000	1142	657	244
7	6202	ACTUARIAL SCIENCE	3777	2110	1667	Business	0.441356	51	2912	2924	296	2482	308	0.0956522	62000	53000	72000	1768	314	259
8	5001	ASTRONOMY AND ASTROPHYSICS	1792	832	960	Physical Sciences	0.535714	10	1526	1085	553	827	33	0.0211674	62000	31500	109000	972	500	220
9	2414	MECHANICAL ENGINEERING	91227	80320	10907	Engineering	0.119559	1029	76442	71298	13101	54639	4650	0.0573423	60000	48000	70000	52844	16384	3253
10	2408	ELECTRICAL ENGINEERING	81527	65511	16016	Engineering	0.19645	631	61928	55450	12695	41413	3895	0.0591738	60000	45000	72000	45829	10874	3170

... (163 rows omitted)

To make direct comparisons across majors, we combined all the tables above into a single one for us to use below.

majors = Table.read_table("cal_vs_all.csv")
majors.show(10)

Index	Major	Major Category	Median Income Difference	Cal P25th	Cal Median	Cal P75th	Overall P25th	Overall Median	Overall P75th
1	American Studies	Humanities & Liberal Arts	15000	41600	55000	60000	30000	40000	42000
2	Anthropology	Humanities & Liberal Arts	13600	36500	41600	51000	20000	28000	38000
3	Applied Mathematics	Computers & Mathematics	35004	65000	80004	108000	34000	45000	63000
4	Art	Arts	18380	38640	48880	56390	21000	30500	41000
5	Astrophysics	Physical Sciences	-2000	50800	60000	77182	31500	62000	109000
6	Bioengineering	Engineering	13900	54997	71000	86500	40000	57100	76000
7	Business Administration	Business	37000	65000	75000	85000	29000	38000	50000
8	Chemical Biology	Biology & Life Science	12520	44000	49920	68000	29000	37400	50000
9	Chemical Engineering	Engineering	5000	65000	70000	80000	50000	65000	75000
10	Chemistry	Physical Sciences	9000	37220	48000	61000	30000	39000	49900

... (39 rows omitted)

Our combined table above dropped the columns in above tables that we didn’t need to conduct our exploration. It has a column Median Income Difference: this column is the Berkeley median income minus the overall median income for each major. It gives us a sense of the value of Cal over the average university: the difference is the additional income we recieve from obtaining a Cal degree.

Before moving forward, take a second to consider how well the above tables would match with each other. For example, Electrical Engineering and Computer Science (EECS) is a popular major at Berkeley. However, the majors dataset didn’t have a direct equivalent for it. Instead, the majors dataset had Electrical Engineering, Electrical Engineering Technologies and Computer Engineering as separate majors. Since in theory students in EECS focus more on computer engineering, we chose to use the computer engineering data for drawing comparions in our final, combined table. However, there’s room for ambiguity here and that is another potential flaw in our exploration!

The below graph displays all the median salaries for all the majors in our dataset side by side. Feel free to look at the values for a few seconds - do you find anything interesting? [Following image is a bar chart of Median Incomes for Selected Majors]

The most lucrative major#

Let’s imagine that Belfort is a freshman at Berkeley. Belfort is a very unique individual - he is someone who would be equally good at all majors and enjoys all majors equally. He also believes that money is the most important thing in the world. These assumptions are vastly oversimplifcations and not the case for anyone in real life; but in Economics you’ll find that we end up making significant simplifications to abstract away all the potential complications!

Since Belfort is equally good at and happy with all majors at Cal, the major he chooses is purely dependent on how much money he can earn after college with that major. Therefore, he will choose the major with the highest median income.

Let us sort our table by the Cal Median column in descending order to see which major that would be.

majors.sort('Cal Median',descending=True).show(10)

Index	Major	Major Category	Median Income Difference	Cal P25th	Cal Median	Cal P75th	Overall P25th	Overall Median	Overall P75th
15	Electrical Eng & Comp Sci	Engineering	55000	105000	115000	120000	45000	60000	75000
28	L&S Data Science	Computers & Mathematics	70200	90000	111500	120000	20000	41300	46000
27	L&S Computer Science	Computers & Mathematics	57000	103000	110000	120000	39000	53000	70000
30	L&S Ops Research & Mgmt Sci	Business	35250	75000	86250	106500	38000	51000	60000
39	Music	Arts	51000	51060	82000	121250	22300	31000	42000
3	Applied Mathematics	Computers & Mathematics	35004	65000	80004	108000	34000	45000	63000
48	Statistics	Computers & Mathematics	35000	61000	80000	107000	26700	45000	60000
35	Materials Science & Eng	Engineering	28000	72700	80000	84000	35000	52000	62000
24	Industrial Eng & Ops Rsch	Engineering	23000	72250	80000	87500	37900	57000	67000
16	Engineering Physics	Engineering	22000	65250	80000	100000	25000	58000	74000

... (39 rows omitted)

However, what if instead of just wanting the maximum amount of cash after college, Belfort was super proud of getting into Berkeley and instead wanted to maximise the amount of additional benefit he recieves from attending Berkeley over other universities? Let’s sort our table by the Median Income Difference column in descending order to see which major would give Belfort the highest additional income from going to Berkeley.

Do any values in either of our sorted table surprise you? Why might that be?

majors.sort("Median Income Difference",descending=True).show(10)

Index	Major	Major Category	Median Income Difference	Cal P25th	Cal Median	Cal P75th	Overall P25th	Overall Median	Overall P75th
28	L&S Data Science	Computers & Mathematics	70200	90000	111500	120000	20000	41300	46000
27	L&S Computer Science	Computers & Mathematics	57000	103000	110000	120000	39000	53000	70000
15	Electrical Eng & Comp Sci	Engineering	55000	105000	115000	120000	45000	60000	75000
39	Music	Arts	51000	51060	82000	121250	22300	31000	42000
7	Business Administration	Business	37000	65000	75000	85000	29000	38000	50000
30	L&S Ops Research & Mgmt Sci	Business	35250	75000	86250	106500	38000	51000	60000
3	Applied Mathematics	Computers & Mathematics	35004	65000	80004	108000	34000	45000	63000
48	Statistics	Computers & Mathematics	35000	61000	80000	107000	26700	45000	60000
12	Cognitive Science	Biology & Life Science	34000	53000	75000	105000	20000	41000	60000
43	Physics	Physical Sciences	29364	54996	74364	100000	30000	45000	68000

... (39 rows omitted)

Let’s now do the opposite and sort our table by the Median Income Difference column in ascending order.

majors.sort("Median Income Difference").show(10)

Index	Major	Major Category	Median Income Difference	Cal P25th	Cal Median	Cal P75th	Overall P25th	Overall Median	Overall P75th
40	Nuclear Engineering	Engineering	-2150	57625	62850	80675	50000	65000	102000
5	Astrophysics	Physical Sciences	-2000	50800	60000	77182	31500	62000	109000
9	Chemical Engineering	Engineering	5000	65000	70000	80000	50000	65000	75000
41	Nutritional Science	Health	6200	36500	41200	49980	26000	35000	45000
13	Conserv & Resource Stds	Agriculture & Natural Resources	6600	37580	41600	57020	25000	35000	42000
10	Chemistry	Physical Sciences	9000	37220	48000	61000	30000	39000	49900
22	Geography	Social Science	10000	35000	48000	60000	30000	38000	50000
32	L&S Social Welfare	Psychology & Social Work	10000	35520	40000	50000	25000	30000	35000
25	Integrative Biology	Biology & Life Science	10280	33280	43680	56160	24000	33400	45000
21	Forestry and Natural Resources	Agriculture & Natural Resources	10600	35360	45600	51000	28600	35000	48000

... (39 rows omitted)

As you can see in the table, the median income difference for both nuclear engineering and astrophysics is negative. This is rather peculiar, especially since Cal has great programs for both those majors. Why would Cal graduates earn less income than the graduates of the average college for those 2 majors?

Hint: consider what each of our datasets describes. Is it an apples-to-apples comparison?

The answer is because the dataset with the median incomes for all colleges actually described recent graduates (people who are 28 and younger, rather than just fresh college graduates. It is likely that most people who graduate with Nuclear Engineering and Astrophysics degrees in other colleges tend to stay in the industry for their fields and keep accumulating experience. This would explain why a dataset that includes data from 28 year olds would have a higher median income than fresh Cal graduates in the same field.

Generalization and Causal Inference#

While Belfort would be equally good at and enjoy all majors equally, the implications for him may not generalize for anyone else.

If you decide to major in EECS, it does not mean that you will have more income post-grad than that from majoring in anything else. For example, someone who is a fantastic artist but not very good at computer science would probably make far more money majoring in art than in EECS. In addition, just because EECS has the highest median income overall doesn’t mean that any individual is guaranteed a high median income if they decide to major in EECS. Overall, our results are not completely generalizable to others.

The causal effect of majoring in EECS on incomes may also be overstated. Consider the fact that those who major in EECS tend to come from families whose parents are already in working in tech. As a result, these students will likely also find better jobs post-grad. This is what we call a confounding variable – it is positively correlated with our treatment variable of majoring in EECS. If we do not observe this confounding factor, our analysis would likely overstate the effect of an EECS degree on post-grad incomes.

A brief discourse on utility#

From seeing our results, you may be wondering: why doesn’t everyone just major in EECS then?

For Belfort, money is the only thing in the world that derives happiness. However, this is once again an oversimplication for the vast majority of people in the real world. People derive utility from sources other than just their bank account; often, we choose relatively lower-paying jobs in order to derive more happiness in our lives.

Utility is a measure of ‘satisfaction’ or ‘happiness’ when we consume a good, while disutility is a measure of ‘unsatisfaction’ or ‘harm’ when we consume a bad. In economics, one assumption we will always make is that people will always seek to maximize their utility and minimize their disutility.

However, we encounter diminishing marginal utility as we consume more and more of a good. For example, the utility of the first cookie you eat is likely much higher than the 20th cookie you eat. Similarly, we may encounter increasing marginal disutility as we consume more and more of a bad. For example, the disutility of the first hour doing something you dread is probably not as bad as the 10th hour of doing something you dread.

For many of us, money provides a source of utility, while working or the studying required to get there may provide a source of disutility. This presents a tradeoff: for some, the disutility from studying EECS or working as a programmer greatly outweights the utility from the money. Perhaps another job in a different field requires significantly lower disutility to achieve, without much impact on the utility of the income associated with it. This phenomenon would explain why not everyone decides to major in EECS!

Incomes Over Time: Computer Science vs Economics#

So far our exploration of the data has considered a rather general timeframe of “post graduation”, but perhaps it would be more insightful if we dove in and took a look at how incomes differed across various ages. We can do so using the table below that consists of data collected by the UC ClioMetric History Project. By compiling a database of digital student transcripts of all UC undergraduates and then linking California wages to individuals in the transcript database, they were able to produce comprehensive dashboards that visualize demographics, major choices, and long-run incomes of each UC campus’s alumni over the past 70 years.

You can learn more about how they collected the data in Incomes Over Time , but for now we will just focus on a snapshot of the data. In particular, we are interested in exploring how percentile incomes compare between Cal alumni who majored in computer science versus those who majored in economics. Run the following cell to check it out.

cs_vs_econ=Table.read_table("cs_econ_by_age.csv")
cs_vs_econ.show(10)

Age	CS 90th Percentile	CS 75th Percentile	CS Median	CS 25th Percentile	Econ 90th Percentile	Econ 75th Percentile	Econ Median	Econ 25th Percentile
23	176659	137090	106548	79617	117187	90009	69737	52210
24	222420	157764	118886	92054	127454	100859	81425	62719
25	232527	165109	128178	103616	151641	114465	90069	69207
26	261114	175940	132967	105601	166973	125092	97294	74010
27	261653	172378	132357	104193	176418	131773	102361	75553
28	252461	175559	135852	107205	192756	140982	107022	76929
29	259180	180283	141214	111220	207451	153648	115540	82523
30	276090	193007	148817	118115	235170	168052	123457	87689
31	299140	203522	154232	121146	254526	177747	129407	90303
32	314279	212274	160000	126626	279246	187899	135302	93089

... (30 rows omitted)

Here we have a table where each row corresponds to a certain age. It begins with age 23 (typically how old people are when getting their first job out of college), and goes up until age 62 (when people retire from the workforce).

Let’s graph our data using some line plots: line plots are especially useful when trying to visualize trends that change over time.

Run the following cell to visualize the data. The yellow lines correspond to CS incomes, while the blue lines correspond to Econ incomes.

[Following image is a graph of “Comparing Incomes of Computer Science Majors vs Economics Majors”]

../../_images/5abb7303a962d703938d4248f6615052837c774a000e88f89368a4166fc4b2ea.png

Let’s take a look at how the incomes compare at each percentile. The most notable trend is probably the fact that for the most the part, CS majors typically make more money than Econ majors. We especially observe this trend in the 25th and 50th percentiles, and this follows accordingly with our exploration earlier in the notebook that students who majored in L&S Computer Science on average had a higher median income than pretty much every other major.

The graph above helps explain why CS is one of the most popular majors here at Berkeley. Software engineering and related disciplines have a huge demand for fresh talent, so that individuals entering these fields are duly compensated. This in and of itself is a big appeal of students wanting to major in computer science.

Right off the bat out of college, fresh CS graduates are looking at a six figure income: even the 25th percentile of CS majors are making over $103,000 by the time they are 25. At any given percentile level, CS graduates outearn their Economics counterparts immediately post-graduation.

Intertemporal Effects#

However, we don’t just choose our majors based off of the immediate post-graduation income. Intuitively, we would want to maximize our total lifetime income we make. This idea is called intertermporal utility maximization: Economists often take into account one’s utility across all possible time periods.

Considering lifetime income, which major would be better? Well, it depends: if you are at the bottom 25th or 50th percentile, CS is better. This suggests that the typical Economics major will likely never outearn the the typical CS major.

However, we observe that Economics majors at higher percentiles ultimately overtake CS majors in earnings. Intuitively, this should make sense. Perhaps top Economics majors are more likely to become successful executives later on in their careers while make more money than top CS majors, who continue to stay as software engineers or become engineering managers.

Risk#

We observe that Economics majors have a much larger spread in their lifetime earnings than CS majors: top Economics graduates make significantly more than bottom Economics graduates, compared to top CS and bottom CS graduates.

This brings to the light the notion of risk. Majoring in CS carries relatively lower variance: whether you are a top notch developer in the 75th percentile or a subpar developer in the 25th percentile, you will probably make a solid living. Since most CS majors typically end up pursuing similar careers in the realms of software development, there is less variance in the occupations. There is also less risk: most CS majors can expect to land a software engineering job and continue working as one twenty years into their career.

The same story is less true for Economics majors: a top-notch Economics major will significantly out-earn a subpar Economics major. This is partly due to the vastly differing occupation types Economics majors become: analysts, bankers, consultants, just to name a few (Economics is a very versatile degree, afterall). Twenty years down the line, the top Economics majors will have higher ceilings perhaps with a lot of opportunity for leadership or executive positions. There is a lot more variance in the types of jobs and thus salaries that Economics majors will attain. In addition, there is a lot more risk: not every Economics major will become a top executive, and most end up with more ‘average’ jobs that earn less than that of similar percentile CS majors.

This means that if you know you are at the top regardless of your major, you will tend to ultimately earn more income as an Economics major than a CS major. But a caveat: it’s impossible to guarantee that you will be in the top, no matter how hard you work. Often, life has uncertainties and involves a great deal of luck to become successful.

The table and graph above only considered the percentile incomes of CS and Economics majors, but if you’re interested in drawing more comparisons, check out Long Rund outcomes for UC Berkeley Alumni. You can compare between all types of popular majors at UC Berkeley, and even toggle around with more features like gender, ethnicity, or even specific courses.

An Afterword#

In this demo, we’ve covered a series of fundamental Economics ideas such as income, utility, and risk. We’ve also gone over a series of more statistical concepts that are also at the heart of empirical Economics such as causal inference, generalization, and unobserved variables. We did this all using data science techniques and visualizations, while being cognizant of potential sources of error. We will be revisiting a lot of these themes in later parts of the course.

From this exercise you may have noticed that pursuing either Economics, Data Science, or both are great options in terms of post-graduation incomes. This means you’ve probably chosen wisely to have ended up in this class.

Welcome to Data 88E!

Introduction

Contents

Introduction#

The Financial Benefits of Your Major#

Data Collection#

Mean vs Median#

The most lucrative major#

Generalization and Causal Inference#

A brief discourse on utility#

Incomes Over Time: Computer Science vs Economics#

Intertemporal Effects#

Risk#

An Afterword#