Overview: Testing
Thorough testing supports the development of accurate, reliable, and robust models in real-world scenarios. This section covers strategies, recommendations, and guidelines for evaluating algorithms and outputs for bias, dealing with bias, and conducting impact assessments.
Ensure Model is Human Interpretable
The simplest way to ensure models are interpretable is to use models that, by their design, are interpretable.1 This type of model refers to a machine learning model that produces results or predictions that can be easily understood, explained, and interpreted by humans, particularly those without a deep technical background in machine learning or artificial intelligence. These models are designed to provide transparency and clarity regarding how they make predictions or decisions, which is essential for gaining trust, identifying potential biases, and ensuring accountability in AI systems.
Interpretable models, such as linear regression, logistic regression, and decision trees, provide visibility into the factors influencing predictions, enabling stakeholders, including those affected by the model's outcomes, to understand the reasoning behind decisions. This transparency fosters trust and facilitates meaningful stakeholder engagement, empowering them to challenge or validate model outputs and identify potential biases or disparities. While neural network models commonly used in deep learning may lack inherent interpretability, techniques exist to enhance their transparency, thereby improving understanding and promoting equitable outcomes. Examples of interpretable AI techniques include SHAP, LIME, and Permutation Importance. For more information on these methods, refer to the resource "Interpretable Machine Learning". Organizations can promote fairness, mitigate bias, and uphold ethical principles by prioritizing interpretability in model development and deployment, ultimately advancing equity in data-driven decision-making processes.
Depending on the model's complexity, methods for model interpretability can be classified into intrinsic analysis and post hoc analysis. Developers can apply intrinsic analysis to interpret low-complexity models (simple relationships between the input variables and the predictions). They can apply post-hoc analysis to interpret simpler and more complex models, such as neural networks, which can capture non-linear interactions.2 These methods are often model-agnostic and provide mechanisms to interpret a trained model based on the inputs and output predictions. Post-hoc analysis can be performed at a local or global level.
An AWS white paper provides a detailed explanation of the various techniques for model interpretability.
Evaluate Algorithms and Output for Bias
Algorithmic bias occurs when AI/ML model design, data, and sampling result in measurably different model performance for different subgroups.3 The two major courses of algorithmic bias are subgroup invalidity and label choice bias. Subgroup invalidity occurs when AI/ML predicts an appropriate outcome or measure, but the model performs poorly for a particular subgroup. The underlying cause is when AI/ML models are trained on non-diverse populations or with data that underrepresents the subgroup or fails to include specific risk factors affecting them. Label choice bias occurs when the algorithm’s predicted outcome is a proxy variable for the actual outcome it should be predicting.4
Algorithmic bias has the potential to cause serious harm and even have a life-threatening impact on consumers of the output of ML/AI models. An example is a facial recognition algorithm that can more easily recognize white faces than Black faces, thereby leading to potential discrimination and hindering equal opportunity. Hence, there is a practical need to evaluate algorithms and output for bias. There are a few steps that can be implemented:5
- Inventory of algorithms - Maintain a detailed inventory of all algorithms being used or developed. Include information such as the purpose, functionality, development stage, and key components. Form a diverse committee comprising internal and external stakeholders, including subject matter experts, data scientists, legal experts, and ethicists. This committee should meet regularly to discuss algorithmic developments, ethical considerations, and potential impacts.
- Screen each algorithm for bias - Clearly define what bias means in the context of the application. Bias can manifest in various ways, such as racial, gender, or socioeconomic bias.
- On the technical side, Python libraries like Aequitas can be used to test for bias in the features and output of the algorithm. Aequitas is an open-source bias audit toolkit for machine learning developers, analysts, and policymakers to audit machine learning models for discrimination and bias, and make informed and equitable decisions around developing and deploying predictive risk-assessment tools.6
- Encourage users to provide feedback on the algorithm's outputs, especially if they believe bias may be present. This feedback loop can be valuable for continuous improvement. For more information, see section Customer Feedback and Impact.
For more guidance and support with stakeholder engagement, explore this helpful resource: Stakeholder Engagement Throughout The Development Lifecycle
For more guidance and support with transparency, explore this helpful resource: Transparency Throughout the AI Development Lifecycle
Mitigating Biases Between vs. Within Groups
Between-group Bias
Between-group bias in machine learning refers to biases arising from differential treatment or outcomes experienced by distinct demographic groups within a dataset or model predictions. Machine learning models may exhibit differential performance across demographic groups, leading to disparate outcomes. For example, a model predicting loan approval may be more accurate for one demographic group compared to another, leading to unequal access to financial opportunities. Biases present in the training data can lead to between-group bias in machine learning models. If certain demographic groups are underrepresented or misrepresented in the training data, the model may learn to make inaccurate or unfair predictions for those groups. Between-group bias can also arise from differences in outcomes experienced by different demographic groups as a result of model predictions. For example, if a hiring algorithm disproportionately selects candidates from one demographic group over others, it may perpetuate existing disparities in employment opportunities.
The bias mitigation strategies below address between-group bias in machine learning.
- Diverse Representation in Data: Ensure the training data includes diverse samples from all relevant demographic groups to avoid under-representation or marginalization.
- Fair Sampling: Use techniques such as stratified sampling to ensure that each subgroup is adequately represented in the training data, preventing biases related to group size.
- Demographic-Aware Features: Develop features sensitive to demographic differences, ensuring that the algorithm considers and appropriately adjusts for variations between groups.
- Equalized Odds: Implement fairness metrics like equalized odds to ensure that the algorithm's predictions are equally accurate across different demographic groups.
- Fair Pre-Processing: Apply pre-processing techniques to the data to mitigate biases before training the algorithm, addressing disparities between groups.
- Bias-Aware Algorithms: Use algorithms specifically designed to detect and mitigate biases. Fairness-aware machine learning models incorporate fairness constraints during training. For example, developers can add a fairness regularization term to the loss function or modify the optimization algorithm to ensure fairness.
Bias detection and mitigation methods often prioritize "between-group fairness," aiming to address disparities in model predictions among distinct demographic groups, which is crucial for rectifying long-standing inequalities. However, it's equally important to consider complexities within these groups, known as "within-group fairness." 7 Existing approaches may overlook within-group issues, exacerbating disparities, and blindly optimizing fairness metrics may not reveal who within each group is impacted.
Within-group Bias
Within-group bias refers to the tendency of machine learning algorithms to favor or prioritize certain groups of individuals over others based on specific characteristics such as race, gender, or ethnicity. This bias can arise when the data used to train the machine learning algorithm is imbalanced or represents a limited range of individuals. Datasets that train AI software are usually skewed towards one group or against another.8 Intersectionality is a concept that expands the understanding of biases by considering how various social categories such as race, gender, and class intersect to create unique experiences of discrimination and privilege. In the context of within-group bias in machine learning, intersectionality highlights how biases are not only based on individual characteristics but also on the intersections of these characteristics. For example, a machine learning algorithm that exhibits within-group bias may favor one racial group over others and disproportionately favor one gender within that racial group, which means that individuals who belong to multiple marginalized groups may face compounded biases that are not accounted for by simply considering each characteristic in isolation.9 Understanding within-group bias through an intersectional lens allows for a more comprehensive examination of the complexities of bias in machine learning and emphasizes the need for more inclusive and diverse datasets to mitigate these biases.
Using the strategies below can help mitigate within-group bias.
- Subgroup Analysis: Conduct subgroup analysis to identify biases within specific groups. This analysis involves examining how the algorithm's performance varies across subpopulations.
- Fine-Grained Fairness Metrics: Define and use fine-grained fairness metrics that assess performance within subgroups, preventing the perpetuation of biases.
- Customized Mitigation Strategies: Develop targeted mitigation strategies for specific subgroups if biases are identified. These strategies could involve adjusting model parameters or implementing custom fairness interventions.
- Localized Training Data: Consider using localized or customized training data for certain subgroups to address specific cultural or contextual nuances.
- Sensitivity Analysis: Conduct sensitivity analyses to understand how changes in input features affect outcomes within different subgroups. These analyses can help identify and address biases at a granular level.
- User Feedback Loops: Establish feedback mechanisms that allow users to report biases or issues specific to their subgroup. Use this feedback to improve the algorithm continuously.
- Inclusive Design Principles: Incorporate inclusive design principles, involving users from diverse backgrounds in the development process to ensure that the algorithm meets the needs of all user groups.
- Regular Audits: Implement regular audits and evaluations of the algorithm's performance, focusing on both overall fairness and fairness within specific subgroups.
To approach bias mitigation comprehensively, developers should address both systemic biases between groups and biases within specific subgroups. Additionally, involving stakeholders from diverse backgrounds throughout the development process enhances the likelihood of creating fair and unbiased algorithms.
Thresholds in Post Development
In machine learning, thresholds play a significant role in various post-development activities, especially in model deployment, monitoring, and decision-making processes. Thresholds are established to manage risk exposure and mitigate potential negative consequences of model predictions. Risk thresholds determine acceptable levels of false positives or false negatives based on the application's specific context and risk tolerance.
Model Deployment
Thresholding for Classification: In classification tasks, thresholds are used to determine the class assignment of model predictions.10 For example, in binary classification, a threshold is applied to convert probability scores into binary outcomes (e.g., class 1 if probability > 0.5, class 0 otherwise).
Threshold Optimization: Thresholds may be optimized during deployment to achieve specific performance objectives, such as maximizing accuracy, precision, recall, or minimizing false positives or false negatives.
Model Monitoring
Anomaly Detection: Thresholds are set to detect anomalies or deviations in model predictions or performance metrics. For example, thresholds on prediction probabilities or residuals may trigger alerts when values exceed predefined limits, indicating potential issues or changes in data patterns.
Data Drift Monitoring: Thresholds are used to detect data drift by comparing current data distributions or statistical properties with baseline or historical data. Deviations beyond threshold values may signal shifts in data characteristics requiring model retraining or recalibration.
Decision-Making Processes
Actionable Thresholds: Thresholds define decision boundaries for taking specific actions based on model predictions. For instance, in risk assessment applications, thresholds determine whether to approve or deny a loan application, initiate medical interventions, or escalate security alerts.
Ethical Thresholds: Thresholds can be set to enforce ethical considerations and regulatory requirements, such as fairness, privacy, or non-discrimination. Models may be designed to comply with predefined thresholds for fairness metrics or privacy constraints.
Explaining thresholds to end-users or decision-makers can facilitate understanding and trust in model predictions. Transparent communication about threshold settings fosters confidence in the reliability and fairness of model decisions.
Impact Assessment of Synthetic Data
Before deploying the model to end users, developers should conduct impact assessments through simulated tests on synthetic data. This practice is a prudent strategy to comprehend potential outcomes, evaluate risks, and enhance the model's performance.
By simulating various scenarios and testing the model's performance on synthetic data that reflects the diversity of the population it will impact, developers can identify and address inequities or adverse outcomes before deployment.11 Additionally, incorporating an equity-focused blueprint for conducting impact assessments can provide a structured framework for evaluating risks, identifying potential sources of bias, and implementing corrective measures to enhance the model's fairness and effectiveness. The following blueprint can be useful in conducting such an assessment.
Define Objectives and Scenarios
Objective Setting: Clearly define the objectives of the impact assessment, such as evaluating model performance in the context of fairness and impact on underrepresented demographic groups, identifying potential biases, or assessing business impact on underrepresented demographic groups.
Scenario Generation: Develop a set of realistic scenarios that represent different use cases, edge cases, and potential challenges the model may encounter in production.
Synthetic Data Generation
Data Synthesis: Generate synthetic datasets that mimic the characteristics and distributions of real-world data. Ensure that synthetic data covers a wide range of scenarios, has ample representation of various demographic groups, and includes relevant features and patterns observed in actual data.
Model Evaluation
Model Testing: Apply the model to synthetic data to simulate real-world predictions and outcomes. Evaluate model performance using predefined metrics and benchmarks.
Bias Detection: Assess potential biases in model predictions by analyzing outcomes across different demographic groups or sensitive attributes. Use fairness metrics and techniques to identify and mitigate biases.
Risk Assessment
Risk Identification: Identify potential risks or vulnerabilities associated with model predictions and decision-making processes. Consider ethical, legal, and operational risks that may arise from model deployment.
Impact Analysis: Assess the potential impact of model errors or failures on stakeholders, customers, and the organization. Quantify risks and prioritize mitigation strategies based on severity and likelihood.
Sensitivity Analysis
Parameter Variation: Conduct sensitivity analysis by varying model parameters, hyperparameters, or input features to understand their impact on model predictions. Adopt a more granular approach by conducting sensitivity analysis on racial and demographic subgroups to understand how model predictions change for such groups. Identify robustness thresholds and determine how changes in parameters affect model performance and stability.
Validation and Iteration
Cross-Validation: Validate model performance on synthetic data using cross-validation techniques to ensure robustness and generalization.
Iterative Refinement: Iterate on model development based on insights gained from impact assessment. Refine model architecture, feature engineering, or training strategies to address identified risks and improve performance.
Documentation and Reporting
Results Documentation: Document findings from the impact assessment, including model evaluation results, risk assessments, and mitigation strategies.
Recommendations: Provide recommendations for model improvement, risk mitigation, and deployment strategies based on assessment outcomes.
Communication: Communicate assessment results and recommendations to relevant stakeholders, including data scientists, business leaders, and compliance officers.
Continuous Learning and Improvement
Feedback Incorporation: Incorporate feedback from the impact assessment into the model development process. Continuously monitor model performance in production and refine strategies based on real-world feedback and data.
Developers wishing to dive deeper into the technical aspects of ensuring equity in AI can access our GitHub site.
Reference this resource we created, Testing Guiding Questions, to support your discussion at this phase.
- Molnar, C. (2023). Chapter 5 Interpretable Models. Github.io. christophm.github.io
- Model interpretability - Machine Learning Best Practices in Healthcare and Life Sciences. (2024). Amazon.com. aws.amazon.com
- Tuck, B. (2022). Four Steps to Measure and Mitigate Algorithmic Bias in Healthcare. ClosedLoop. closedloop.ai
- Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing.(2020). ACM Conferences. dl.acm.org
- 4 Steps to Mitigate Algorithmic Bias. (2024). American Hospital Association. aha.org
- Aequitas - The Bias Report. (2018). Dssg.io. aequitas.dssg.io
- Goethals, S., Calders, T., & Martens, D. (n.d.). Beyond accuracy-fairness: stop evaluating bias mitigation methods solely on between-group metrics. arXiv. arxiv.org
- In-group bias. (n.d.). The Decision Lab. thedecisionlab.com
- Buolamwini, J., Gebru, T., Friedler, S., & Wilson, C. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research, 81(81), 1–15. proceedings.mlr.press
- Introduction to Operating Thresholds. (2018). The Official Blog of BigML.com. blog.bigml.com
- Zhang, X., Pérez-Stable, E. J., Bourne, P. E., Peprah, E., Duru, O. K., Breen, N., Berrigan, D., Wood, F., Jackson, J. S., Wong, D. W. S., & Denny, J. (2017). Big Data Science: Opportunities and Challenges to Address Minority Health and Health Disparities in the 21st Century. Ethnicity & disease, 27(2), 95–106. doi.org
Overview: Testing
Thorough testing supports the development of accurate, reliable, and robust models in real-world scenarios. This section covers strategies, recommendations, and guidelines for evaluating algorithms and outputs for bias, dealing with bias, and conducting impact assessments.