April 1, 2024

•

5 min (est.)

•

Vol. 81

•

No. 7

The Hidden Rigors of Data Science

Mahmoud Harding

Rachel Levy

Not all students will become data scientists, but in an era dependent on data, they’ll need to know how to work—and think—using data.

Premium Resource

Teaching Strategies

The Hidden Rigors of Data Science Header Image

Credit: Hiraman / iStock

Hidden Figures, the Oscar-nominated biopic, shares the story of three female African American mathematicians who were hired by NASA during the Space Race to work as “human computers,” performing and verifying calculations by hand. These women—Katherine Johnson, Dorothy Vaughan, and Mary Jackson—were literally hidden away from the other NASA engineers. As a result, the outsized contributions they made to the fields of aerospace engineering and space exploration went unrecognized for decades.

In one scene in the movie, Katherine Johnson proposes employing an iterative method to solve a problem about the trajectory of a spacecraft. Iterative methods are a common way computers solve problems—starting with a guess and then improving the prediction or solution until it is “good enough.” Today, computers often solve problems using iterative techniques fueled by data. The methods by which that data is collected, processed, analyzed, modeled, and used in decision-making are what we call data science, which in the future might likely be taught throughout K–12.

If you were to look at a key paper Katherine Johnson co-wrote in 1960, you’d probably think it looks like “rigorous math.” It features symbols, Greek letters, and lots of equations. But the “rigor” in that paper wasn’t only in the mathematical computations. Johnson, Vaughan, and Jackson also analyzed test flight data and used the results to make decisions. Their mathematical ideas were applied to solve problems that, prior to the first manned space flight, had never before existed. They needed to make precise estimates of potential error, apply theory to a new and ambiguous problem with many unpredictable factors, account for multiple contingencies—and ultimately produce a numeric result.

What Do We Mean by Rigor?

When people talk about a rigorous class, they sometimes use the words rigorous and difficult interchangeably, or equate rigor with precision. These definitions don’t capture the ways in which subjects like data science are rigorous. The idea of rigor as difficult or precise also doesn’t reveal what the concept of rigor often means in scientific research. In subjects like statistics and data science, which provide quantitative foundations for trustworthy research in the sciences, a practice that’s rigorous needs to account for inevitable error and variability—which is different from not having any error or variance at all. It’s important that teachers have a grasp of what rigor means in data science.

In mathematics, rigor is a prized concept because an argument that isn’t rigorous falls apart. The mathematical form of argumentation known as a proof (used to establish the validity of a mathematical statement) is one standard of mathematical rigor. Proofs often start with axioms, things assumed to be true, but not proven (for example, “parallel lines do not cross”). The person creating the proof then establishes from that axiom a logical chain of ideas that leads to a general result: a proven truth (proof) that can be used in theoretical and applied mathematics. The proof includes a series of statements that follow each other in a logical order to show that if the axioms are true, the theorem must also be true. Understanding proofs is important in some K–12 math classes, such as geometry.

Proofs are considered rigorous because they are thorough, sound, logical, reasoned, and “airtight.” There are no errors and no ways to find fault or “break” the proof so that the theorem isn’t true. In high school mathematics, students are frequently tasked with finding answers to math problems that rely on axioms, theorems, and proofs. These problems are often structured so there is little to no variation in their answers or in the “steps” used to arrive at them. As a result, many people assume that answers within the context of math problems can only exist on the narrow spectrum of right or wrong.

In academic research, industry, and especially in life, people use data to solve problems that don’t always have a single, correct and precise answer.

In academic research, industry, and especially in life, people use data to solve problems that don’t always have a single, correct and precise answer. We often need answers that are “good enough” instead of exact. For instance, you need to choose shoes in a size that fits well enough—which may mean fitting comfortably when you buy them, but can mean real comfort only after they have time to “break in.”

To understand rigor in data science, one needs to comprehend what rigor looks like in all its applied domains. Data science instruction in secondary school should represent the way rigor is understood and practiced in the STEM workforce, and many jobs outside of STEM, such as sports management, can also be data-intensive.

Four Ideas Illuminate Rigor in Data Science

In data science, rigor is manifested in ways that often appear different from what rigor looks like in high school courses like AP Calculus or AP Statistics. These courses are indeed rigorous. But data science, which includes elements from those courses and more, is also rigorous in its own right. The true rigor of data science, much like the contributions made by the protagonists of Hidden Figures, is often concealed, but four general ideas can shed light on it.

1. Rigorous data isn’t the same as “truth.”

Within the context of data science, rigorous data is authentic, unbiased, and truly representative. Often, the data needs to be a random sample, because it’s too time-consuming, expensive, or simply impossible to collect all the data related to a phenomenon. If the data collection method isn’t rigorous, then everything else is unlikely to be useful. This kind of rigor isn’t always obvious in educational settings because students are often only exposed to “clean data”—like in predefined experiments in a textbook. Students rarely encounter (or are challenged to collect) data sets containing three or more variables; data sets that contain quantitative and qualitative variables; or data composed of words, pictures, or sound. But these types of information feed the decision-making processes in industry and government.

2. Decision-making requires both data and models.

A fundamental element of data science is utilizing data along with a combination of technology, mathematics, and statistics, to gain insight or make decisions—often using a model. You might be familiar with a simple model, like a line y = 4x, which describes a relationship between two variables x and y that has a slope of 4. This model says that every time you increase x by 1, you increase y by 4, otherwise known as a slope of 4. Some data can be described well enough by a linear model. For example, if you take a roomful of people and compare their arm spans (x) to their heights (y), in general all the “points” of these measurements will line up pretty well. If you then draw a line that’s best for hitting all the points, you won’t hit them all, but you’ll see you have a pretty good linear model—a rigorous option available to describe the data well without doing something overly complicated to hit all the points. In STEM, we learn a whole toolkit of mathematical functions like lines, exponentials, and trig functions that can model many phenomena in science and in daily life.

However, much of real life is multivariable. For example, if you want to choose a restaurant, you’ll likely consider many variables: the type of food, location, cost, hours of operation, and ratings. If you’re with a group, you’ll need more than just data on various restaurants; you’ll need to know the preferences of your group. People have a lot of practice doing these types of calculations in their heads, while balancing priorities and constraints, such as others’ preferences. We don’t have to write down a formula to decide where to eat. But for a business that has to make decisions involving thousands of calculations with different variables and constraints, an informal solution isn’t going to work. A model is needed. In data science, many models are still based on linear models, but with many more variables and many more equations. And just like with the arm span and height example, where you needed enough data points to conclude that they were lining up, when you have more variables and more equations, you’re going to need more data. Handling all this data requires more sophisticated tools.

This is why to study data science, people may start with spreadsheets, but eventually are going to want to use tools that run on computers that can store more complicated information and process it. The reality is that some models these days are getting so complicated that it’s difficult for humans to completely understand how the models work—and when they are going to fail. To develop transparent, trustworthy data science and artificial intelligence models, our STEM students are going to need to make connections between the way those algorithms work and the foundational concepts in STEM. Most of us won’t be the ones looking under the hood of those algorithms, but we will still need to test and double check the output and decide whether the solution provided should be trusted.

3. Everyone needs some grasp of data science—and programming.

While all students won’t be interested in becoming the algorithm developers of the future, all young people will use these algorithms, such as when they rely on an online recommendation system to choose a restaurant or navigate the route to get there. Having some insight into the data science processes used to develop algorithms and an idea of how they work and how to test their answers for accuracy (and whether any solutions provided should be trusted) is an essential skill many of us lack.

AI may change how we program computers, but a conceptual understanding of how programming works, what it can and can’t do, and how to test a computer program for flaws will remain essential skills.

Teachers may wonder whether all K–12 students need to learn computer science. We would answer that all students should have an understanding of how programming works. Artificial intelligence (AI) may change how we program computers, but a conceptual understanding of how programming works, what it can and can’t do, and how to test a computer program for flaws will remain essential skills. And for students going into the data science profession, learning some computer programming skills is a must.

The incorporation of technology doesn’t reduce the rigor involved when students are learning data science. Computers and software allow us to explore data, visualize data (data using multiple variables that can hold numbers, words, sounds, or even pictures), and simulate experiments. In this context, rigor requires using technology appropriately, learning how to iterate through versions of a solution, and evaluating whether you can declare victory because your solution is “good enough.”

4. Data science demands understanding of context.

Data science requires that students understand the domain of the data and the context of the problem. For the mathematicians portrayed in Hidden Figures, that context was space flight. Katherine Johnson wasn’t an astronaut. But she needed to develop a working knowledge of the discipline of aerospace engineering as it related to the context she was working within. Her data analysis and mathematical results had to be useful for astronauts and accurate enough to help them safely return home.

Data science skills could be taught in the context of many subjects. Stand-alone data science courses were first taught as an elective in a limited number of high schools.Currently, in our state of North Carolina, data science appears in computer science requirements, while in neighboring Virginia, it appears in math requirements. However a state or district ushers in data science courses, offering them can provide students opportunities to engage in rigorous data science practices and prepare to navigate the complexities in a society that’s become dependent on data.

In traditional mathematics instruction, students are given steps to solve specific problem types and expected to learn how to solve these problems using only those steps. However, in data science education, it’s imperative that students move beyond “toy problems” that lack real data and that rely on predefined steps. Students need to tackle scenarios that reflect authentic challenges—and this means exposing students to datasets where they can genuinely practice the key components of data science.

It will also require sharpening students’ critical thinking. Regardless of how educators think data science can or should be taught in K–12 education, we know college professors want to teach students who are adept at critical thinking and analysis. The workforce requires the same skills. This will be even more true as artificial intelligence allows repetitive and menial tasks to be automated. Students can develop critical thinking and analysis skills by using a data-science methodology that accounts for the major phases of problem solving: exploration, prediction, and inference. Each phase of this methodology in and of itself is connected to mathematics and aligned to the thoughtful, responsible, and rigorous use of technology.

Infusing Data Science into K–12 Education

The reality is, not enough students have access to computer science courses, even as AI is changing the nature of computation and programming. We also can’t redirect all math or science teachers into data science because we still need to teach the fundamentals of core STEM subjects. Fortunately, there are lots of ways to approach data science in current instruction. Teachers in many subjects can infuse certain lessons or units with data explorations and data storytelling so students can make sense of data, understand the context around it, and use data to gain insight about the world and facilitate responsible decision-making. For example, a history teacher teaching about the industrial revolution could explore many kinds of data with students, such as the change in population around that time, how the foods people ate changed, and how the types of pollution in the environment shifted. Students could explore data in some aspect that interests them and—with support from their math teacher—make connections through visualizations and modeling.

In a world where algorithms make recommendations about everything from who gets a loan to who gets an organ transplant, it’s imperative that we give all students a foundation in data science.

In a world where algorithms make recommendations about everything from who gets a loan to who gets an organ transplant, it’s imperative that we introduce all students to data science, giving them a foundation to navigate data.

Even though Katherine Johnson and her colleagues were “hidden,” the evidence of their contributions to the space race became visible, at least to NASA engineers, once a manned rocket safely returned to Earth from outer space. Eventually a big-screen motion picture spread awareness of their work, inspiring a new generation of programmers and scientists. If we want to reveal the possibilities of data science to our students, we’ll all need to grow and explore new forms of rigor.

End Notes

•

1 See the PDF of Skopinski and Johnson's report for NASA here.

•

2 National Academies of Sciences, Engineering, and Medicine, Division of Behavioral and Social Sciences and Education, and Board on Science Education. (2023). Foundations of data science for students in grades K-12: Proceedings of a workshop. National Academies Press.

Mahmoud Harding is instructional design director for Data Science 4 Everyone and a former math and data science teacher at the North Carolina School of Science and Mathematics.

Learn More

Rachel Levy is executive director of the Data Science Academy, a professor of mathematics at North Carolina State University, and former deputy executive director of the Mathematical Association of America (MAA).