5. Data Provenance

A description of the way in which the training data was collected should be maintained by the builders of the algorithms, accompanied by an exploration of the potential biases induced by the human or algorithmic data gathering process. Public scrutiny of the data provides maximum opportunity for corrections. However, concerns over privacy, protecting trade secrets, or revelation of analytics that might allow malicious actors to game the system can justify restricting access to qualified and authorized individuals.
Principle: Principles for Algorithmic Transparency and Accountability, Jan 12, 2017

Published by ACM US Public Policy Council (USACM)

Related Principles

· 2. Data Governance

The quality of the data sets used is paramount for the performance of the trained machine learning solutions. Even if the data is handled in a privacy preserving way, there are requirements that have to be fulfilled in order to have high quality AI. The datasets gathered inevitably contain biases, and one has to be able to prune these away before engaging in training. This may also be done in the training itself by requiring a symmetric behaviour over known issues in the training set. In addition, it must be ensured that the proper division of the data which is being set into training, as well as validation and testing of those sets, is carefully conducted in order to achieve a realistic picture of the performance of the AI system. It must particularly be ensured that anonymisation of the data is done in a way that enables the division of the data into sets to make sure that a certain data – for instance, images from same persons – do not end up into both the training and test sets, as this would disqualify the latter. The integrity of the data gathering has to be ensured. Feeding malicious data into the system may change the behaviour of the AI solutions. This is especially important for self learning systems. It is therefore advisable to always keep record of the data that is fed to the AI systems. When data is gathered from human behaviour, it may contain misjudgement, errors and mistakes. In large enough data sets these will be diluted since correct actions usually overrun the errors, yet a trace of thereof remains in the data. To trust the data gathering process, it must be ensured that such data will not be used against the individuals who provided the data. Instead, the findings of bias should be used to look forward and lead to better processes and instructions – improving our decisions making and strengthening our institutions.

Published by The European Commission’s High-Level Expert Group on Artificial Intelligence in Draft Ethics Guidelines for Trustworthy AI, Dec 18, 2018

· 7. Respect for Privacy

Privacy and data protection must be guaranteed at all stages of the life cycle of the AI system. This includes all data provided by the user, but also all information generated about the user over the course of his or her interactions with the AI system (e.g. outputs that the AI system generated for specific users, how users responded to particular recommendations, etc.). Digital records of human behaviour can reveal highly sensitive data, not only in terms of preferences, but also regarding sexual orientation, age, gender, religious and political views. The person in control of such information could use this to his her advantage. Organisations must be mindful of how data is used and might impact users, and ensure full compliance with the GDPR as well as other applicable regulation dealing with privacy and data protection.

Published by The European Commission’s High-Level Expert Group on Artificial Intelligence in Draft Ethics Guidelines for Trustworthy AI, Dec 18, 2018

· Prepare Input Data:

1 Following the best practice of responsible data acquisition, handling, classification, and management must be a priority to ensure that results and outcomes align with the AI system’s set goals and objectives. Effective data quality soundness and procurement begin by ensuring the integrity of the data source and data accuracy in representing all observations to avoid the systematic disadvantaging of under represented or advantaging over represented groups. The quantity and quality of the data sets should be sufficient and accurate to serve the purpose of the system. The sample size of the data collected or procured has a significant impact on the accuracy and fairness of the outputs of a trained model. 2 Sensitive personal data attributes which are defined in the plan and design phase should not be included in the model data not to feed the existing bias on them. Also, the proxies of the sensitive features should be analyzed and not included in the input data. In some cases, this may not be possible due to the accuracy or objective of the AI system. In this case, the justification of the usage of the sensitive personal data attributes or their proxies should be provided. 3 Causality based feature selection should be ensured. Selected features should be verified with business owners and non technical teams. 4 Automated decision support technologies present major risks of bias and unwanted application at the deployment phase, so it is critical to set out mechanisms to prevent harmful and discriminatory results at this phase.

Published by SDAIA in AI Ethics Principles, Sept 14, 2022

Plan and Design:

1 The planning and design of the AI system and its associated algorithm must be configured and modelled in a manner such that there is respect for the protection of the privacy of individuals, personal data is not misused and exploited, and the decision criteria of the automated technology is not based on personally identifying characteristics or information. 2 The use of personal information should be limited only to that which is necessary for the proper functioning of the system. The design of AI systems resulting in the profiling of individuals or communities may only occur if approved by Chief Compliance and Ethics Officer, Compliance Officer or in compliance with a code of ethics and conduct developed by a national regulatory authority for the specific sector or industry. 3 The security and protection blueprint of the AI system, including the data to be processed and the algorithm to be used, should be aligned to best practices to be able to withstand cyberattacks and data breach attempts. 4 Privacy and security legal frameworks and standards should be followed and customized for the particular use case or organization. 5 An important aspect of privacy and security is data architecture; consequently, data classification and profiling should be planned to define the levels of protection and usage of personal data. 6 Security mechanisms for de identification should be planned for the sensitive or personal data in the system. Furthermore, read write update actions should be authorized for the relevant groups.

Published by SDAIA in AI Ethics Principles, Sept 14, 2022

· Prepare Input Data:

1 An important aspect of the Accountability and Responsibility principle during Prepare Input Data step in the AI System Lifecycle is data quality as it affects the outcome of the AI model and decisions accordingly. It is, therefore, important to do necessary data quality checks, clean data and ensure the integrity of the data in order to get accurate results and capture intended behavior in supervised and unsupervised models. 2 Data sets should be approved and signed off before commencing with developing the AI model. Furthermore, the data should be cleansed from societal biases. In parallel with the fairness principle, the sensitive features should not be included in the model data. In the event that sensitive features need to be included, the rationale or trade off behind the decision for such inclusion should be clearly explained. The data preparation process and data quality checks should be documented and validated by responsible parties. 3 The documentation of the process is necessary for auditing and risk mitigation. Data must be properly acquired, classified, processed, and accessible to ease human intervention and control at later stages when needed.

Published by SDAIA in AI Ethics Principles, Sept 14, 2022