There’s More to Data Science than Math and Programming
By Aaron R. Williams and Claire McKay Bowen of the Urban Institute
🎥 Video Interview: DS4E Communications Specialist Shea Stripling speaks with Aaron R. Williams, Lead Data Scientist for Statistical Computing at the Urban Institute, and Claire McKay Bowen, Senior Fellow at the Urban Institute, about why data science requires more than math and programming, highlighting the real-world skills that shape meaningful and ethical data work and explaining how educators can introduce these skills to students in their classrooms.
Data-driven decision-making and the data that fuels it are important in research, business, and government. Since September 2025, we’ve chronicled importance of federal statistics, where federal statistics come from, the many daily uses of federal statistics, and more.
Our posts have highlighted how data scientists and statisticians play a critical role in data-driven decision-making. People interested in becoming a data scientist often hear so much about the importance of math and programming. Math and programming are important skills, but there are many other skills that are often overlooked that can lead to success in the field.
1. Answering Useful Questions
George E. P. Box famously wrote in 1976, “All models are wrong, but some are useful.” Nearly 50 years later (and the aphorism being said to predate Box’s initial writings), this statement remains relevant in the age of AI. Math and programming can answer many questions, but not always useful ones. An example is Zillow’s house-flipping[1] venture, which ended in 2021 after losing over $300 million in a single quarter due to inaccurate price predictions from its Zestimate models. Once boasting a median absolute percentage error of about 5% across 110 million homes, these models deteriorated over time, leading Zillow to overpay for properties and incur massive losses. The failure underscores five critical lessons for data science to answer useful questions with more useful models:
· Data quality matters: small inaccuracies in data (e.g., number of rooms and distance from schools) can “snowball” into massive financial impacts.
· Ensure humans are in the loop: algorithms shouldn’t be the sole decision‐makers, especially in high‑stakes domains.
· Anticipate people gaming the system: market participants may manipulate data, so fraud‑detection is crucial.
· Adopt holistic modeling: forecasting should encompass not just property attributes, but also buyer behavior and demand dynamics.
· Consider external factors: cost inflation, labor shortages, and market shifts must be factored in.
2. Questionnaire design
Data are not found, they are created. Gathering responses to questionnaires is a widespread process for creating data. The design and wording of questionnaires is incredibly important. For example, in January 2003, Pew Research Center found a major impact of questionnaire wording. When asked whether people, “favor or oppose taking military action in Iraq to end Saddam Hussein’s rule,” 68% said they favored military action while 25% said they opposed military action. However, when asked if people “favor or oppose taking military action in Iraq to end Saddam Hussein’s rule even if it meant that U.S. forces might suffer thousands of casualties,” responses were dramatically different; only 43% said they favored military action, while 48% said they opposed it.
The history of questionnaire design is full of these examples. As a response, data scientists and statisticians have a rich set of tools they use to develop questionnaires including evidence-based ways of designing questions, cognitive interviews, and validation.
3. Subject Matter Expertise
Subject matter expertise (i.e., topical knowledge) remains essential to understanding how models will perform in the real world. Data scientists follow a series of best practices when developing models to ensure their models perform well on new data. How well a model performs on new data compared to the original data is called generalization.
Math and programming are still necessary for building models, but, similar to our first bullet, subject matter expertise is necessary for knowing when the model will be useful or not. Google Flu Trends is a famous example of what can go wrong when applying a model to unseen data. Using historical data and 50 million common Google searches, Google developed a model that could predict flu-like illnesses. Their model could report data almost immediately while the CDC’s method took about two weeks. Speeding up information about flu-like illnesses could be crucial to understanding and remedying flu outbreaks. This was an incredible development and showed how massive corporate data could benefit the public good.
Google launched Google Flu Trends (GFT) in 2008 but eventually closed down GFT in 2015 after it failed to predict the 2009 flu pandemic and consistently over-estimated illnesses in later years.
So, what happened? Changes in peoples’ search behaviors and changes to the Google search algorithm meant the model, which was trained on historical data, did not make sense on new data. Data science requires subject matter expertise, attention to detail, and persistence to ensure that what worked in the past works in the future.[2]
4. Ethics
Finally, just because a data scientist can do something doesn’t mean a data scientist should. Government statistical agencies like the U.S. Census Bureau and Statistics of Income Division at the IRS use statistical disclosure control and disclosure review boards to protect individual privacy. Public-sector data science organizations use institutional review boards to ensure that any projects using human subjects data are ethical and responsible. In contrast, many private companies use data science to learn customers’ secrets to grow their businesses.
Math and programming are important skills for data science, but successful data science requires more than technical skills. It requires a clear understanding of the question being asked, understanding the process used to create the data, the subject matter expertise to ensure the solutions meet the needs of users, and the ethical knowledge to decide if the project is appropriate.
Closing note from DS4E:
These four skills map directly onto the K–12 Data Science Learning Progressions. Asking “useful” questions is exactly what Substrand D2: Problem Identification & Question Formation is about, making sure the question is worth answering before you build a model. Questionnaire design lives in Substrand B2: Designing for Data Collection and Substrand B3: Measurement & Datafication because the way we ask, measure, and record shapes everything that follows. Subject matter expertise shows up in Substrand C5: Models of Data, where context determines whether a model will actually hold up as the world changes. And ethics is central to Concept A2.1: Data Use Risks & Benefits, reminding us that data work always carries real consequences for real people.
[1] “In finance, flipping is purchasing an asset to quickly resell (or “flip”) it for profit. Within the real estate industry, the term is used by investors to describe the process of buying, rehabbing, and selling properties for profit.” Accessed January 2, 2026. https://en.wikipedia.org/wiki/Flipping
[2] Check out the CDC FluSight Challenge that encouraged academic, industry, and government forecasting teams to develop models to forecast the influenza season. https://www.cdc.gov/flu-forecasting/evaluation/2024-2025-report.html

