Data Methodology

Overview: Data Methodology

The integrity and effectiveness of machine learning models depend on the methodology adopted for data collection, manipulation, and use. This section outlines the methodology factors for working with data that ensure comprehensiveness, fairness, and ethical compliance.

Data Collection

Data is collected through a combination of automated and manual methods to ensure diversity and depth. Automated systems, such as web scraping tools and APIs, efficiently gather large volumes of data from digital platforms and databases. While these systems are a relatively fast, accurate, and cost-effective tool for gathering data, they introduce challenges around data privacy, system complexity, and an overreliance on technology. Simultaneously, surveys and interviews are manual methods to capture qualitative insights, especially from populations less represented online. A dual approach to data collection facilitates a rich dataset that is both broad in scope and deep in insights.¹

‍

Data Collector(s)

Data collectors play a crucial role in shaping the datasets that underpin AI models, and their actions can have significant implications for equity and fairness. They should strive to ensure that the datasets they collect are representative of the diverse populations they aim to serve. This approach involves actively seeking out data from underrepresented groups and communities to prevent biases and disparities in the resulting AI models. In addition, they are responsible for upholding ethical standards in their data collection practices. This practice includes obtaining informed consent from individuals before collecting their data, respecting individuals' privacy rights, and ensuring that data collection methods do not harm or exploit vulnerable populations.² Moreover, data collectors should be transparent about how data will be used and shared, providing clear explanations and options for individuals to opt-out if desired. Finally, data collectors should be vigilant for biases and discriminatory practices in their data collection processes. This technique includes recognizing and addressing biases that may be present in the selection of data sources, the design of data collection instruments, or the interpretation of collected data. By actively mitigating bias and discrimination, data collectors can help ensure that AI models produce equitable outcomes and do not perpetuate or exacerbate existing disparities.

‍

Bias Considerations

Bias introduced in the data collection phase can lead to skewed and unfair outcomes, affecting the performance and reliability of machine learning models.³ Those leading AI efforts must ensure that the dataset reflects the diversity of the population of interest, including implementing strategies to include underrepresented groups in the dataset. Various types of bias can be introduced as follows:

Selection bias occurs when selecting participants or data points for a study leads to a sample not representative of the population intended to be analyzed. This situation can happen due to non-random assignment to groups based on certain characteristics or when certain groups are more likely to be included than others due to the study design or recruitment process. To mitigate selection bias, researchers can use randomization techniques to ensure that all population members are equally likely to be selected.
Sampling bias, a subset of selection bias, arises when some population members are systematically more likely to be selected in a sample than others. This skewing can limit the generalizability of findings because the sample does not accurately represent the broader population. Strategies to avoid sampling bias include defining a clear target population, using random sampling methods, and ensuring the sampling frame matches the target population as closely as possible.
Confirmation bias is the tendency to search for, interpret, and recall information in a way that confirms one's preconceptions, leading to statistical errors and skewed data interpretation. To combat confirmation bias, researchers should pre-register their studies, stating hypotheses and analysis plans in advance, and seek out data and analyses that challenge their hypotheses.
Measurement bias occurs when the data collection method systematically distorts the measurements in a particular direction. This distortion can be due to flawed measurement instruments, differences in data collection procedures, or inconsistent data recording. Ensuring that measurement instruments are calibrated and standardized across all data collection points can help reduce measurement bias.
Observer bias arises when researchers' expectations or knowledge about the purpose of the study influence their observations or interpretations of data. This situation can be particularly problematic in studies involving subjective measurements. Blinding researchers to the study's hypotheses or the participants' group assignments can help minimize observer bias. Blinding helps mitigate this bias by preventing researchers from consciously or unconsciously influencing the study outcomes based on their preconceived notions.

AI systems can inadvertently perpetuate and amplify biases in their training data, leading to unfair treatment of certain groups of students. Using diverse and representative datasets and regularly auditing AI systems for bias can help mitigate this situation. Additionally, transparency in AI decision-making processes can help identify and correct biases.

‍

Relevance for Education

In the context of Education/EdTech, biases can lead to racial disparities in student achievement and discipline, as educators' implicit biases may contribute to entrenched inequality in academic achievement and school discipline between Black and White students. Training programs to reduce bias have shown little promise so far, and experts are exploring other approaches to address the connection between implicit bias and decision-making. EdTech can worsen racial inequality if it reflects the biases of its developers or the data it's trained on, potentially putting students on a slower learning track. Assessment and grading practices can also be influenced by bias, affecting both students and instructors. Implicit bias in schools can lead to excessive discipline, lower teacher expectations, and over-critical grading procedures, indirectly linked to higher dropout rates and lower higher education outcomes.

‍

Mitigation Techniques

Techniques that are common practices to address potential biases include:⁴

Diverse Sampling Strategies: Implementing diverse sampling techniques such as stratified sampling, oversampling of minority groups, or collecting data from multiple sources to ensure adequate representation of all demographic groups within the population of interest.
Data Augmentation: Generating synthetic data points or augmenting existing data samples to increase the representation of underrepresented groups in the dataset while ensuring that the synthetic data accurately reflects the characteristics and distributions of the original data. Please see the "Overview of Synthetic Data" section.
Human-in-the-Loop Approaches: Involving human annotators or domain experts in data labeling to identify and correct biases, clarify ambiguous labels, and ensure the cultural and contextual relevance of the collected data.

‍

For more guidance on bias mitigation strategies in data collection, explore these helpful resources:

Fairness and Machine Learning

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries

‍

By adopting these methodologies and considerations, the data collection process aims to be comprehensive, fair, and ethically sound. This approach enriches the dataset and ensures that the development of machine learning models is grounded in responsible and inclusive practices.

‍

Data Ingestion

Data ingestion involves transforming raw data into a usable format that can be leveraged for insights and decision-making.⁵ When organizations attend to this phase, they can minimize errors, improve data quality, and streamline the subsequent stages of data processing, analysis, and interpretation, ultimately enhancing the reliability and effectiveness of their data-driven initiatives. Bias in data ingestion can occur at different stages of this process and may stem from various sources. Bias can originate from the selection of data sources. If data sources are not representative of the population or phenomenon being studied, the resulting dataset may be biased. For example, if a recruitment platform predominantly attracts job seekers from certain demographics, the collected data may not reflect the diversity of the workforce.

‍

Data Format

Data may be structured in various formats, including CSV (Comma Separated Values), JSON (JavaScript Object Notation), and SQL databases, among others. The format is determined by the source of the data and the requirements of the system it's being ingested into. CSV files are widely used for their simplicity and compatibility with most data processing tools. JSON is preferred for nested data structures, while SQL databases are used for structured data that requires relational databases.

‍

Ingestion Process

The ingestion process typically involves ETL (Extract, Transform, Load) steps. Data is first extracted from its source, then transformed to fit the system's requirements (which may include cleaning, normalization, and integration steps), and finally loaded into the target system. Data integrity checks are performed throughout to ensure accuracy and completeness. Data engineers commonly use tools like Apache NiFi, Talend, Apache Spark, and Informatica for managing complex data pipelines.

‍

Pre-Sanitization

Pre-sanitization refers to the steps taken to ensure data quality and integrity before ingestion. This process can include removing personal identifying information (PII) to maintain privacy, correcting errors, and standardizing formats. The data provider often undertakes initial sanitization efforts, but additional checks are typically performed during the ingestion process.

‍

Pre-sanitization processes are essential for safeguarding privacy, mitigating biases, and promoting fairness in data-driven initiatives. While the primary focus of pre-sanitization is often on removing personal identifying information (PII) to protect individuals' privacy rights, it also presents an opportunity to address equity concerns. For instance, in addition to removing PII, pre-sanitization efforts should prioritize the identification and mitigation of biases that may disproportionately impact marginalized communities. These efforts could involve correcting errors and standardizing formats to ensure the data accurately reflects diverse perspectives and experiences. Moreover, transparency and accountability in pre-sanitization procedures are crucial for building trust among stakeholders, particularly those from marginalized groups. By implementing rigorous pre-sanitization processes, organizations can uphold fairness, inclusivity, and ethical integrity throughout the data lifecycle, ultimately fostering more equitable outcomes in data-driven decision-making processes.

‍

For more guidance on bias mitigation strategies in data collection, explore this helpful resource: Apache NiFi documentation

‍

Removing Data

Removing data is a common practice in modifying datasets to address bias, privacy, or irrelevance issues. This approach may involve removing sensitive or identifying information to protect privacy, eliminating biased or skewed data points to reduce algorithmic bias, or filtering out irrelevant data to improve model performance and efficiency. However, careful consideration must be given to the potential impact of data removal on the integrity and representativeness of the dataset and the implications for the fairness and effectiveness of AI systems trained on the modified data.

‍

Criteria for Removal

When considering the removal of data points or features from a dataset, it is essential to conduct a thorough analysis to determine their relevance and potential impact on the problem being addressed. Features that are irrelevant to the problem at hand or have the potential to introduce bias should be carefully evaluated for removal. Additionally, concerns regarding data privacy should be considered, particularly when dealing with sensitive or personally identifiable information. Decisions to remove data should be made judiciously, balancing the need to mitigate bias and ensure privacy while preserving data integrity and representativeness. Furthermore, documentation of the rationale behind data removal decisions is crucial to maintain transparency and accountability in the data modification process. Overall, a systematic approach to data removal ensures that only truly problematic or unnecessary data is excluded from the dataset while preserving its suitability for analysis and modeling purposes.

‍

Before removing data, the development team must comprehensively assess the potential impact on the dataset's representativeness and the performance of the model trained on it. This evaluation involves analyzing how the removal may affect data distribution across different demographic groups and whether it could introduce new biases or exacerbate existing ones. Additionally, careful consideration should be given to the consequences of data removal on the model's predictive accuracy and fairness, as removing important data points could lead to skewed or unreliable outcomes. Strategies such as stratified sampling or sensitivity analysis may be employed to assess the potential effects of data removal on model performance and equity. Overall, a thoughtful and rigorous approach to evaluating the implications of data removal is essential to maintain the integrity and fairness of AI systems.

‍

Proxying Data

Replacing direct identifiers like names and social security numbers with indirect identifiers or aggregates, such as age ranges and zip code areas, helps keep data anonymous while still being useful for analysis and modeling. This step is crucial in today's digital world to ensure data privacy and protection; this is particularly important in the education/EdTech environment because it involves sensitive information about students and their academic performance.⁶ For more details, please visit the "Overview of Synthetic Data" section in our guide.

‍

Definition of Proxying

In the context of machine learning and AI, proxy data refers to secondary or substitute data that developers use as a stand-in or approximation for the primary data of interest. Proxy data is often employed when direct measurements or observations of the target variable are unavailable, impractical to obtain, or too costly. This substitute data is chosen based on its correlation or relationship with the target variable, with the assumption that it can adequately represent or approximate the underlying phenomenon of interest. Proxy data can include various types of information, such as related variables, surrogates, or indicators that are believed to be associated with the target variable and can serve as a reasonable proxy for modeling purposes. However, it's essential to carefully consider the validity and reliability of proxy data and assess its suitability for the specific task at hand to avoid introducing bias or inaccuracies into machine learning models.

‍

Proxying Techniques

Development teams can employ several techniques for proxying data, including k-anonymity, l-diversity, and differential privacy. K-anonymity ensures that each record is indistinguishable from at least k-1 other records regarding specific identifiers. L-diversity extends k-anonymity by requiring diverse sensitive attributes in each equivalence class. Differential privacy provides a mathematical guarantee that the privacy of individuals in the dataset is protected, even in the presence of auxiliary information. Each technique has strengths and limitations, which should be considered in the specific project context.⁷

‍

In addition to k-anonymity, l-diversity, and differential privacy, there are other techniques commonly employed for proxying data in machine learning and AI:

Data masking involves obfuscating or anonymizing sensitive information in the dataset to protect privacy while retaining the overall structure and utility of the data. Common data masking techniques include randomization, perturbation, and encryption.
Homomorphic encryption allows computations to be performed directly on encrypted data without decrypting it first, preserving privacy while enabling data analysis and machine learning tasks to be performed on sensitive data.
Federated learning enables machine learning models to be trained across multiple decentralized devices or servers without exchanging raw data. Instead, model updates are aggregated centrally, preserving data privacy while enabling collaborative model training.
Distributed differential privacy extends differential privacy to distributed settings, allowing privacy guarantees to be maintained across multiple data sources or parties while enabling collaborative data analysis and machine learning.

These techniques are vital in mitigating the risks of re-identification and unauthorized disclosure of sensitive information, particularly for marginalized or vulnerable populations. K-anonymity ensures that individuals cannot be singled out based on specific identifiers, safeguarding against discriminatory practices and potential privacy breaches. L-diversity goes further by requiring diversity in sensitive attributes within each equivalence class, enhancing protection for individuals with intersecting identities. Additionally, differential privacy offers a robust mathematical framework for quantifying and guaranteeing privacy protection, even in the face of external data attacks or auxiliary information. However, each technique has its strengths and limitations, and their application should be tailored to the specific project context while considering potential equity implications. Organizations must prioritize transparency, accountability, and stakeholder engagement in implementing data-proxying techniques to ensure privacy protection measures align with ethical principles and promote equitable outcomes for all individuals involved.

‍

Thresholds for Data Modification

Thresholds should be defined for when and how to modify data, like removing or proxying variables, to balance the utility and integrity of data needed for analysis and model training purposes with privacy and fairness considerations.

‍

Equity considerations require organizations to establish thresholds that respect the privacy expectations of the individuals represented in the data, especially those from marginalized or vulnerable communities. These thresholds should be justified based on the sensitivity of the data and the analytical goals of the project, ensuring that privacy protections are not disproportionately burdensome or restrictive. Moreover, transparency and stakeholder engagement are essential in establishing thresholds, allowing for input from diverse perspectives, and promoting accountability in decision-making processes.

‍

For more guidance on privacy and fairness considerations, explore this helpful resource: The Algorithmic Foundations of Differential Privacy

‍

Defining Thresholds

The process for defining thresholds includes evaluating the minimum sample size required to maintain statistical significance, determining the level of privacy protection needed, and considering the acceptable trade-off between data utility and fairness. These thresholds should be defined so that data modifications do not undermine the dataset's integrity or the resulting models' efficacy.⁵

‍

There are several additional considerations when defining thresholds for data modifications in privacy-preserving techniques:

Consider the granularity of the data being analyzed. Data modifications should be applied at an appropriate level of granularity to balance privacy protection and data utility effectively. For example, aggregating or anonymizing data at too high a level of granularity may lead to the loss of important information, while applying modifications at too low a level may increase the risk of re-identification.
Take into account contextual factors that may influence the sensitivity of the data or the privacy expectations of individuals. For instance, the nature of the data (e.g., financial, health, location) and the context in which it is collected (e.g., online interactions, workplace activities) can impact the level of privacy protection required and the acceptable trade-offs between privacy and utility.
Assess the performance requirements of the models being trained or deployed using the modified data. Define thresholds that ensure data modifications do not compromise the resulting models' accuracy, reliability, or fairness. This process may involve conducting sensitivity analyses or performance evaluations to assess the impact of different thresholds on model performance.
Recognize that thresholds may need to be adjusted dynamically based on evolving data characteristics, privacy regulations, or project requirements. Implement mechanisms for monitoring and adapting thresholds over time to maintain the dataset's integrity and the resulting models' efficacy.
Document the rationale behind threshold decisions and ensure transparency in the data modification process. Clearly communicate the thresholds used and the implications for data privacy, utility, and model performance to stakeholders, including data subjects, project teams, and regulatory authorities.

Example

‍Consider an AI-based adaptive learning system that aggregates student interaction data, test results, and exercise completion times to adjust the difficulty of tasks and recommend additional study materials. The granularity of the data collected is crucial; too detailed, and it could risk re-identification of students, too coarse, and it might lose its effectiveness in personalizing learning. Additionally, the system must consider contextual factors such as the sensitivity of the data (e.g., health information within an educational health app) and the context in which it is collected (e.g., data gathered in a public or private setting). Performance requirements are also a key consideration. The AI system must ensure that the data modifications do not degrade the accuracy or fairness of the models used to adapt learning experiences for students. This might involve sensitivity analyses to determine the impact of different levels of data granularity on model performance. Moreover, as the characteristics of the data and privacy regulations evolve, the system may need to dynamically adjust its thresholds to maintain the integrity of the dataset and the efficacy of the resulting models. Transparency in the data modification process is essential, and the rationale behind threshold decisions must be documented and communicated clearly to stakeholders, including students, educators, and regulatory authorities.

By considering these additional factors and adopting a systematic approach to defining thresholds for data modifications, organizations can effectively safeguard data privacy while preserving the integrity and utility of their datasets and models.

‍

Validation

Validating that the established thresholds are appropriate involves conducting sensitivity analysis, retraining models to assess the impact of data modifications, and consulting with domain experts or stakeholders. This validation process ensures that the thresholds achieve the intended balance between privacy, fairness, and utility.

‍

Equity considerations extend beyond technical validation and require engaging with domain experts, community representatives, and affected stakeholders to assess the social and ethical ramifications of data modifications. By soliciting input from those directly impacted by data-driven decisions, organizations can ensure that thresholds strike an appropriate balance between privacy protection, fairness, and utility. Moreover, transparency throughout the validation process is essential for building trust and accountability, allowing stakeholders to understand the rationale behind threshold decisions and providing opportunities for feedback and adjustment. Ultimately, validating thresholds from an equity perspective involves recognizing the nuanced and context-dependent nature of privacy and fairness considerations and prioritizing inclusivity, transparency, and ethical integrity in decision-making.

‍

Sensitivity analysis is a technique used to assess the sensitivity or responsiveness of a model's output to small changes in input variables or parameters. It involves systematically varying the input variables or parameters within a specified range and observing the resulting changes in the model's output. Sensitivity analysis aims to understand how uncertainties or variations in input factors impact the model's predictions, insights, or conclusions. Some sensitivity analysis techniques include:

Multi-Way Sensitivity Analysis: This involves simultaneously varying multiple input variables or parameters to assess their collective impact on the model's output. It allows for examining interactions between different factors and their combined influence on the model's predictions.⁸
Scenario Analysis: Scenario analysis involves analyzing the model's output under different hypothetical scenarios or conditions. Scenarios may represent different assumptions, events, or outcomes, allowing stakeholders to explore a range of possible future states and their implications.
Monte Carlo Simulation: Monte Carlo simulation is a probabilistic sensitivity analysis technique that randomly samples input variables or parameters from probability distributions. By running multiple simulations with different sets of sampled values, Monte Carlo simulation provides insights into the overall uncertainty and variability in the model's predictions.

Retraining models in this context refers to the process of updating or adjusting machine learning models based on modified data that has undergone privacy-preserving modifications. After applying thresholds for data modifications, such as anonymization or perturbation techniques, the impact of these modifications on model performance needs to be evaluated. Retraining models involves using the modified data to retrain the machine learning models and assess how the changes affect the model's accuracy, fairness, and overall utility.

‍

In the final validation method, consulting with domain experts and stakeholders, organizations can ensure that the validation process is comprehensive and inclusive, incorporating diverse perspectives and expertise to validate the appropriateness of established thresholds. This collaborative approach helps to build consensus, foster transparency, and enhance the credibility and trustworthiness of the analysis and its outcomes.

‍

Monitoring Methodology

A continuous monitoring process is essential to evaluate the impact of data modification on model performance and fairness over time. This process includes a plan for adjusting thresholds or practices based on feedback, new insights, or changes in the data or the model's application context. Regular review meetings with stakeholders can facilitate this adaptive approach.

‍

For more guidance and support with stakeholder engagement, explore this helpful resource:

Stakeholder Engagement Throughout The Development Lifecycle

‍

Compliance with Regulations

In the context of AI, data modification practices refer to the processes and techniques used to prepare, preprocess, manipulate, or augment data to improve the performance, fairness, privacy, or utility of machine learning models and AI systems. These practices may include various methods for data preprocessing, transformation, cleaning, anonymization, aggregation, or augmentation to enhance data quality, relevance, and suitability for model training, evaluation, and deployment.

‍

All data modification practices must comply with relevant data protection laws and industry standards, such as the General Data Protection Regulation (GDPR), FERPA (Family Educational Rights and Privacy Act), the Health Insurance Portability and Accountability Act (HIPAA), or others pertinent to the project's domain. This compliance ensures that data handling meets legal requirements and protects individual rights.

‍

For more guidance on ethics, governance and regulations for data, explore these helpful resources:

FERPA

European Union's General Data Protection Regulation (GDPR) portal

Health Insurance Portability and Accountability Act

‍

Developers wishing to dive deeper into the technical aspects of ensuring equity in AI can access our GitHub site.

‍

Reference this resource we created, Data Design Guiding Questions to support your discussion at this phase.

‍

For more guidance and support with transparency, explore this helpful resource: Transparency Throughout the AI Development Lifecycle

‍

Porter, N D., Verdery, A M., & Gaddis, S M. (2020). Enhancing big data in the social sciences with crowdsourcing: Data augmentation practices, techniques, and opportunities. PLOS ONE, 15(6), e0233154-e0233154. doi.org
Hood, C. E. (2022). Choosing the Right Educational Technology Tool for Your Teaching: A Data-Privacy Review and Pedagogical Perspective into Teaching with Technology - Lexi Schlosser, Christine E. Hood, Ellen Hogan, Bobby Baca, Amelia Gentile-Mathew, 2022. Journal of Educational Technology Systems. journals.sagepub.com
Jager, K J., Tripepi, G., Chesnaye, N C., Dekker, F W., Zoccali, C., & Stel, V S. (2020). Where to look for the most frequent biases?. Nephrology, 25(6), 435-441. doi.org
Sharma, S., Zhang, Y., Aliaga, J M R., Bouneffouf, D., Muthusamy, V., & Varshney, K R. (2020). Data Augmentation for Discrimination Prevention and Bias Disambiguation. doi.org
Tae, K H., Roh, Y., Oh, Y H., Kim, H., & Whang, S E. (2019). Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach. arXiv. dl.acm.org
Karim, A., Beni-Hessane, A., & Khaloufi, H. (2018). Big healthcare data: preserving security and privacy. doi.org
Urban Institute. (2023). Do no harm guide: Applying equity awareness in data privacy methods. urban.org
Ntoutsi, E., Fafalios, P., Gadiraju, U., Iosifidis, V., Nejdl, W., Vidal, M., Ruggieri, S., Turini, F., Papadopoulos, S., Krasanakis, E., Kompatsiaris, I., Kinder-Kurlanda, K., Wagner, C., Karimi, F., Fernández, M., Alani, H., Berendt, B., Kruegel, T., Heinze, C., ... Staab, S. (2020). Bias in data‐driven artificial intelligence systems—An introductory survey. WIREs Data Mining and Knowledge Discovery, 10(3). doi.org