Overview: Data Availability
The foundation of any robust machine learning model lies in the quality and relevance of the data it is trained on. Decisions made throughout data acquisition, collection, and ingestion directly impact a model's ability to learn, predict, and generalize. These processes encompass not only the gathering of data but also its preparation and integration into the modeling workflow.
Source of Data
Identifying reliable data sources, the first step in the data acquisition process is important because this helps establish which models and approaches can be used to develop needed business outcomes from data. Organizations and individuals that provide data do so from myriad contexts, including academic research, industry-specific datasets, and open-source platforms. For instance, the U.S. Census Bureau offers demographic data that uncovers insights into population trends and societal changes. Such datasets are invaluable for models that aim to predict market trends, social behaviors, or policy impacts, but they are not without bias. Historical and ongoing disparities in access to resources, opportunities, and representation can lead to skewed or incomplete data, failing to accurately capture the experiences, perspectives, and needs of marginalized communities.
To increase the transparency and explainability of the model for the project stakeholders, developers should document these items:
Data Description
A comprehensive description of the dataset, including details about what the data represents, its scope, and its relevance to the specific model being developed. For example, a dataset containing student access, participation, and completion data, student survey data, funding data, and spending data over a decade offers insights into long-term educational success outcomes. However, documenting the description of the dataset is not enough, an equity audit should be conducted on the dataset. For example, such an equity audit can unearth that Black students have been given less funding as compared to White students over the decade. Such data equity audit information should inform the data usage in the context of model development. This information is essential for developing predictive models in education predicting, say, successful college graduation. Initiatives like the Digital Promise Data Equity cohort can help schools identify equity gaps within their systems and work towards addressing disparities in student outcomes through data-informed decision-making.1
Link to Data (if publicly available)
Whenever possible, providing access to the data enhances transparency and allows for independent verification of model results. Public repositories like Kaggle and the UCI Machine Learning Repository are excellent sources of datasets that have been used widely in the machine learning community for research and benchmarking purposes. In the event that it isn’t publicly available data one should consider synthetic data generation. (for example, using tools like - Synthea, Generative AI, Mockaroo, etc.).2 3 One synthetic data generation tool geared toward equity considerations is Synthesized, which is a data development framework designed to create high-quality data products with an equity focus.4
It is important that these tools leverage techniques such as generative adversarial networks (GANs), differential privacy, and data augmentation to create synthetic data that closely mimic the statistical properties of real-world data without compromising individuals' privacy or perpetuating existing biases.
Context of Collection
The rationale behind data collection efforts is as important as the data itself. Understanding why data was gathered helps in assessing its suitability for a particular modeling task. Bias in data collection methods can reinforce existing stereotypes and inequalities. For example, surveys conducted primarily in affluent neighborhoods may overlook the experiences of low-income or marginalized communities. Similarly, biased sampling techniques or exclusionary criteria can lead to data that inaccurately reflects the diversity of the population. Data inequity can occur when:5
- Data is collected without proper consent or transparency, as some applications may claim not to collect personal data while actually exfiltrating it
- Data collected is not representative of the target population or context, leading to biased models and inaccurate predictions
- Data collection and analysis may not consider privacy concerns, potentially leading to data breaches and misuse.
Population Represented by the Data
A critical aspect of data collection is ensuring that the dataset accurately represents the population of interest. This includes considering demographic diversity, geographic coverage, and the temporal span of the data. Models trained on datasets that lack diversity or are biased towards certain groups may not perform equitably across different segments of the population. For example, facial recognition technologies have faced scrutiny for biases in datasets that were not representative of a diverse population, leading to disparities in accuracy.6
Factors Affecting the Quality and Fairness of the Data
In addition to the careful design and execution of data acquisition and collection processes, it is crucial to consider various factors that may influence the quality and fairness of the data. This includes assessing the representativeness of the data sample, identifying and mitigating potential biases in the data, ensuring compliance with privacy regulations and ethical guidelines, and promoting transparency and accountability throughout the data collection pipeline. Collaboration with diverse stakeholders, including domain experts and community representatives, can also provide valuable insights and perspectives to inform the development of more inclusive and socially responsible data collection.
Ensuring that data acquisition and collection processes are thoughtfully designed and executed is paramount for developing fair, accurate, and reliable machine learning models. The integrity of these processes directly influences the model's performance and its applicability in real-world scenarios.
An example of equity-related data acquisition can exists in the public health sector, where the CDC Foundation has developed Principles for Using Public Health Data to Drive Equity.7 These principles aim to create a more equitable data life cycle, ensuring that the data-gathering process, from collection and reporting to analysis, dissemination, and stewardship, is grounded in equity principles. The principles emphasize recognizing and defining systemic factors that affect individual health outcomes, using equity-mindedness as a guide for language and action, allowing for cultural modification, creating shared data agreements, and facilitating data sovereignty.
For more guidance on data acquisition and preparation, explore these helpful resources:
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Data Preparation for Data Mining
For more guidance and support with transparency, explore this helpful resource:
Transparency Throughout the AI Development Lifecycle
Overview of Synthetic Data
Synthetic data in machine learning and AI refers to artificially generated data that mimics the characteristics and distribution of real-world data. This data is created using algorithms or simulation techniques rather than collected from observations or measurements.
There are several use cases that can be addressed with synthetic data sets. For example, they can be used to generate privacy-preserving datasets for training machine learning models without exposing sensitive or personally identifiable information. Another use involves augmenting existing datasets, particularly in cases where collecting additional real-world data may be costly or impractical or in real-world data sets overrepresent specific populations. During the model testing and validation phases, synthetic data allows for the creation of diverse scenarios and edge cases to thoroughly test and validate machine learning models under various conditions.
As with any decision regarding data, synthetic data has both benefits and risks. They can address issues of representation and inclusivity by generating data that reflects diverse populations and scenarios, thereby mitigating biases present in real-world data. Additionally, synthetic data sets can provide a way to explore sensitive or underrepresented topics without compromising individuals' privacy or security, promoting ethical data practices.
More specifically, these benefits include:
- Data Diversity: Synthetic data can help introduce diversity into training datasets, enabling machine learning models to generalize better and perform well on unseen data.
- Privacy Protection: It allows organizations to leverage data for training models while preserving privacy and complying with data protection regulations.
- Cost Savings: Generating synthetic data is often less expensive and time-consuming than collecting real-world data, particularly in domains where data collection is challenging or costly.
- Scalability: Synthetic data generation techniques can be scaled up to quickly produce large volumes of data, facilitating the training of complex machine learning models.8
Some of the risks of synthetic data use include:
- Lack of Realism: Synthetic data may not accurately capture the complexities and nuances present in real-world data, potentially leading to poor generalization and performance of machine learning models.
- Bias Amplification: If synthetic data generation techniques are not carefully designed, they may inadvertently introduce or amplify biases present in the training data, leading to biased model predictions.
- Overfitting: Machine learning models trained on synthetic data may overfit to the synthetic data distribution, resulting in poor performance on real-world data.
- Ethical Concerns: There may be ethical considerations associated with using synthetic data, particularly if it is used to generate data that could be harmful or discriminatory if applied in practice.
When developing an equitable solution, it is important to weigh these benefits and costs. These considerations include the potential impacts of synthetic data on marginalized or vulnerable groups and addressing any disparities or biases that may arise. Furthermore, involving diverse stakeholders in developing and validating synthetic data sets can help ensure that they accurately reflect the experiences and perspectives of all users, leading to more equitable outcomes in AI and data-driven decision-making processes. Overall, while synthetic data sets hold promise for promoting equity in data science and AI, it is essential to carefully evaluate their use in specific contexts and mitigate potential risks to ensure the reliability, fairness, and ethical soundness of machine learning models trained on synthetic data.
Developers wishing to dive deeper into the technical aspects of ensuring equity in AI can access our GitHub site.
Reference this resource we created, Data Availability Guiding Questions, to support your discussion at this phase.
- Nguyen, M., Ross, S., Barron, A., Bates, S., & Smith, K. (2021). How Districts Are Using Data Equity to Drive Decisions and Improvements. Digital Promise. digitalpromise.org
- Your Machine Learning and Data Science Community. (n.d.). Kaggle. kaggle.com
- UCI Machine Learning Repository. (n.d.). UC Irvine Machine Learning Repository. archive.ics
- API-driven production-like data provisioning in non-production environments. (n.d.) Synthesized. synthesized.io
- Ghosh, T. (2023). How EdTech Companies Get Away With Exploiting Data Of Minors. BOOM. boomlive.in
- López-López, E., Pardo, X M., Regueiro, C V., Iglesias, R., & Casado, F E. (2019). Dataset bias exposed in face verification. IET Biometrics, 8(4), 249-258. doi.org
- Hill, F., & Smith, L. (n.d.). Principles for Using Public Health Data to Drive Equity. CDC Foundation. cdcfoundation.org
- Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S., & Weller, A. (2022). Synthetic Data -what, why and how? arxiv.org
Overview: Data Availability
The foundation of any robust machine learning model lies in the quality and relevance of the data it is trained on. Decisions made throughout data acquisition, collection, and ingestion directly impact a model's ability to learn, predict, and generalize. These processes encompass not only the gathering of data but also its preparation and integration into the modeling workflow.