https://www.coursera.org/learn/data-science-ethics/home/welcome Started on [[22-04-2022]] - My aim for this course: I want to prepare myself, so that when i am working on [[HOPES Project Index]] i can be guided by ethics. I am concerned about the impact of biases that are built into the designs. - ## What are Ethics? - Introduction The word “Ethics” is derived from the Greek word Ethos, meaning habit or custom. Ethics tells us about right and wrong. Philosophers have long puzzled over this important subject, and have many insightful things to say. In fact, there even are multiple schools of philosophical thought on this topic. Our course is not about these deep philosophical questions, but about the ethical practice of Data Science. If it is a study of Ethics you care about, broadly, please choose a suitable course in philosophy. For example, Michael Sandel’s course on Justice is just fantastic. - To most people, ethics has to do with morality: some innate sense of what is “good.” For many, their morals are informed by religious teachings. Even though people may follow different religions, most religions agree on many moral principles; for example, that it is wrong to steal. However, there are significant differences too, which is why we can have passionate debates on topics such as the morality of abortion and of the death penalty. Furthermore, corporations are not human beings, and it is tricky to ascribe morality to companies. Yet, we do want corporations to act ethically. - In this course, we would like to sidestep these difficult questions so that we can get straight to what matters to us: the ethical practice of Data Science. Our goal is to understand the large areas where we can quickly come to near universal agreement, and to leave for future discussion the boundaries of these areas, where there can be substantial disagreement. Furthermore, we would like for our framework to include both human and corporate actors. - For these reasons, we choose a very simple utilitarian framework for ethics. If you are a philosopher, you will likely find this framework overly simple. Yet, it suffices for the questions we discuss in this course, and for most questions that arise in Data Science. - From an ethicist’s point of view, the “interesting” questions are the ones where it is not easy to determine right from wrong. Many ethics courses begin with a discussion of the “trolley” problem. In a nutshell, the problem is this: A runaway trolley car is about to run over five innocent people; you can throw a switch to divert the trolley car to a different track where it will run over only one person; should you throw the switch? In other words, should you perform an action that kills one person to save five others? If you begin with a moral position of “It is wrong to kill,” you have to work hard to develop the exception that lets you throw the switch. And if you are able to come to a clear answer to this problem, it is easy to change the problem statement slightly to make your choice more difficult – save the life of one child by killing one adult, perform the killing through a more direct action (like pushing a person on to the track) than just throwing a switch, and so on. - Our needs in this course are different. Data Science is a young field. Big Data is having a huge impact on society, and we are still trying to understand this impact. We would like to practice Data Science ethically, and understand what that involves. So we are most interested in the broad areas where it is easy for us to come to a consensus about right and wrong. We would like to define the basic norms of “socially acceptable behavior” for practitioners of Data Science. Yes, there will remain issues where it is not so easy to agree about what is right. Our goal is not to focus on these issues, but rather on the many more issues where we can come to agreement. - - When i take a picture, or collect data of you, who own that data? The implication of using that data on you. Complex ethical questions. Young field. Large implications. - ## What are Ethics? #ethics - Data science is not just about the techniques and methods, but need to consider ethics. - Ethics - shared principles, we think of "right and wrong" accepted by people. - Religion promote ethics, but ethics may not be religious - Ethics are not laws. Social behavior. Laws often follow ethics. Often created to enforce ethical behavior - Economics (Shared value) - Tragedy of the Commons. Shared field. If every farmer let their sheep graze on the common shared field, then the common space will be destroyed. - ![Screenshot 2022-04-22 at 7.59.45 AM.png](./assets/Screenshot_2022-04-22_at_7.59.45_AM_1650585587568_0.png) - Everyone benefit if we all don't consume too much. - e.g Driving Rules, Litter Rules -- if everyone anyhow, then chaos! - - ## Data Science Needs Ethics - Civilisation is based on rules - With data today, there are many possibilities of what we can do, but should we do? are they things we can do but we shouldn't do? - How data science has impact? - Consequences privacy, fairness etc - - ## Human Subjects Research and Informed Consent Part 1 - Tuskegee Study . To study Syphilis. Since then, Informed Consent was created. - Human participants must give voluntary consent and ability to withdraw from participation - Benefit vs Harm - If benefit is one party, and harm is another party. Then must balance - Institutional Review Board - Exception for informed consent - in Psychology research, need to be reviewed and that there is minimum harm to participants. - ## Human Subjects Research and Informed Consent: Part 2 - In traditional studies, Prospective Data Collection -- First reviewed by IRB, then collect data, then analyse data but restrict to Research, but businesses. - Businesses conduct A/B Testing all the time - Showing different sector of consumers different version of the app/ web design. ![Screenshot 2022-04-22 at 8.39.36 AM.png](./assets/Screenshot_2022-04-22_at_8.39.36_AM_1650587978783_0.png) - Good business, but the implication, that human are subjects on their research without consents. - Facebook/Cornell Experiment - User showed positive or negative articles. Emotional contagion. - PNAS 2014 published. But facebook user not known that they were subject to research. - OK Cupid - Business thought it's good idea, but now social consensus is not good idea to do such things - - ## Limitations of Informed Consent - Retrospective Data Analysis - Nowadays, data collect first, then analyse later - Informed consent not really, it's hidden in terms and conditions fine print. - We may give consent, to give data to a merchant for a specific service but not to be used for any other purposes. - "Context" matters -- Data collected for specific context, cannot be repurpose for other context. - Repurposing of Data - Credit Card company - Medical Data - Must consider the impact on human. - - ## Data Ownership - hard to give credit to data use, because it's mixed with many things to create a model - Old technology, i took that photo, own by me. - The collection, aggregations, that create value become theirs. Though data source from other places. - Data is owned by whoever is recording the data. - Expectation of privacy - changing room. etc - Expectations in situations when there is no contracts. - Telcos knowing your locations etc - Limit the use and recording of data - - Government Surveillance - Record first... use if there is a need later. - - Watched the case study about "Rate My Professor", about third party rating of professor, little recourse on what they can do. - I am thinking about IMH data about patients, belong to IMH about them, there is little patients can do, or recourse to change that data. - --- # Week 2 - Privacy - - Panopticon #privacy How is our behavior has been affected because we think/know that we have been observed. - Privacy is a basic human need - toileting (but Romans don't have privacy ma), Voting. - History of Privacy - The type of disclosure, and the sort of meta-data.. what is consider private, or not, can be hard to decide - e.g giving phone company the number you are dialling, vs the content of the conversation. - Privacy values is a changing attitude - No Option to Exit - there is no escape now. Big Data never forget. Last time people forget, u can move into a new place start a new life. - Rights to be forgotten.. Not really forgotten but harder to find. Since once it's on the internet, it's forever - - Degrees of Privacy - Privacy is not non-disclosure. It is exercise of control. - Sharing depending on purpose and context. - How to control - All or Nothing Agreements - problematic. - have to expect that free services got a price. - Better to provide graduated choices, to allow users choice how much service to give up. Trade off - Friends can Harm Privacy - You may not share, but your friends post it - Relatives can harm privacy - You share your DNA data, thus researcher also can know your relatives DNA (at least parts of the DNA) - Collection Vs Use - Just because it's collected doesn't mean it will be used, or harm you. - Like security camera. - Data Surveillance - I am ok for you to collect, but i am not ok if you share it without my permission. I need certain sense of control about that. - - Modern Privacy Risks - Main driver - Surveillance - Advertising - Introduction - Data Sources - Sensors - Open Government vs Privacy - Data Brokers - Aggregate and link information to create a more complete profile of people - Multiple sources of data - "Waste" Data Collection - when proof of age.. or tracetogether? what will they do with that data? - Metadata - Caller, callee, Time/date, Duration.. location data. type of phone.. - Based on metadata, can know location, been to church, denomination, some kind of beliefs of you - Do not under-estimate analysis - Smart water meter monitor signatures of water use, can tell whether flushing toilet.. washing clothes.. - Water company can tell when you go toilet.. That might consider an invasion of privacy - Trust => Design - In the past, we share information because of trust, in modern data systems, no basis of trust.. only legal agreement which mean, there is no trust. - Privacy must be by design. Not default - Module 4 Discussion Prompt References - **References** - 1. A. Adams. 2014. Report of a debate on Snowden’s actions by ACM members. SIGCAS Computers & Society 44, 3: 5-7. 2. danah boyd, Kate Crawford. 2011. Six provocations for big data. Retrieved from http://papers.ssrn.com/sol3/Papers.cfm?abstract\_id=1926431 3. Roei Davidson, Nathaniel Poor. 2015. The barriers facing artists’ use of crowdfunding platforms: Personality, emotional labor, and going to the well one too many times. New Media & Society 17, 2: 289-307. 4. Roei Davidson, Nathaniel Poor. 2016. Why sugar daddies are only good for Bar-Mitzvahs: Exploring the limits on repeat crowdfunding. Information, Communication, and Society 19, 1: 127-139. 5. Stuart Dredge. 2015. Security researcher publishes 10m usernames and passwords online. The Guardian, 11th February. Retrieved February 22, 2016 from: http://www.theguardian.com/technology/2015/feb/11/securityresearcher-publishes-usernames-passwords-online-mark-burnett 6. Natalie Fenton. 2012. Telling tales: Press, politics, power, and the public interest. Television & New Media 13, 1: 3-6. 7. Jim Fitzgerald. 2013. Journal News removes controversial handgun permit information from website. Associated Press. Retrieved December 16, 2015, from http://www.huffingtonpost.com/2013/01/18/journal-news-handgunremoves-information\_n\_2507774.html 8. Dan Goodin. 2015. Gigabytes of user data from hack of Patreon donations site dumped online. Ars Technica. Retrieved December 9, 2015 from http://arstechnica.com/security/2015/10/gigabytes-of-user-data-from-hack-ofpatreon-donations-site-dumped-online/ 9. Seth C. Lewis, Oscar Westlund. 2015. Big data and journalism. Digital Journalism 3, 3: 447-466. 10. Byron Spice. 2016. Carnegie Mellon, Stanford Researchers Devise Method to Share Password Data Safely: Yahoo! Releases Password Statistics of 70 Million Users For Cybersecurity Studies. Carnegie Mellon News, 22nd February. Retrieved from February 22, 2016 from: http://www.cmu.edu/news/stories/archives/2016/february/sharingpassword-data.html 11. Joseph Walther. 2002. Research ethics in internet-enabled research: Human subjects issues and methodological myopia. Ethics and Information Technology 4, 3: 205-216. - The Data & Society Research Institute Program on Ethics in “Big Data” Research will investigate the potential benefits and challenges put forward in this primer. Through partnerships, collaboration, original research, and technology development, the program seeks cooperation across sectors to innovate and implement thoughtful, balanced, and evidence-based responses to our current and future data-centered issues. - Data & Society is a research institute in New York City that is focused on social, cultural, and ethical issues arising from data centric technological development. To provide frameworks that can help address emergent tensions, D&S is committed to identifying issues at the intersection of technology and society, providing research that can ground public debates, and building a network of researchers and practitioners that can offer insight and direction. To advance public understanding of the issues, D&S brings together diverse constituencies, hosts events, does directed research, creates policy frameworks, and builds demonstration projects that grapple with the challenges and opportunities of a data-saturated world. - **Authors:** - Nathaniel Poor: [email protected] Davidson: [email protected] - **For additional case studies:** - Emily F. Keller and Jacob [email protected] & Society Research Institute36 West 20th Street, 11th Floor New York, NY 10011Tel. 646-832-2038datasociety.net --- ## Week 3 - Data Validity - What is Validity - Bad data, Bad model can potentially be harmful. Real world consequences - Difference Sources of Error - ![[Screenshot 2022-05-10 at 8.42.35 PM.png]] **Choice of Representation Sample** - Drunk looking for keys under lamp post - because we only have available data.. but that data may not be valid? - Representativeness of population - They may only hear those who are the loudest. - "Not everything that can be counted counts, and not everything that counts can be counted" - William Bruce Cameron. - To make sure we have a representative data, we can Balance Important Attributes. - But we may not know missing data to balance. - when AI doesn't have enough data (training data), they mislabel patterns. - Like Google mislabel black people as Gorilla - Project Future Population - ==Something to think about, can the past or current data project/ predict future psychosis or mental illness or whatever? [[HOPES Project Index]]== ^1c8a63 - ![[Screenshot 2022-05-10 at 8.48.30 PM.png]] - Society change over time. Train a while ago may no longer work. **Choice of Attributes and Measures** - Limited by what is available ![[Screenshot 2022-05-10 at 8.52.40 PM.png]] - What attributes are included, - and How you measure them. - Are we measuring what we want to measure? **Errors in Data Processing** - 1. Technology error - Extracting sentiment from text, recognising faces from photos etc.. Merging two records for same person. - 2. Lots of human and subjective error - E.g Credit Reports - Agencies compile credit report and give a score. Common errors found, and affect people whether can take a loan or have to pay higher interest rates. - Third Party Data - Have no way to correct error data. - Desiderata - (what should be in place) - Substantiated - Access - Accountability **Errors in Model Design** ![[Screenshot 2022-05-10 at 9.04.21 PM.png]] ![[Screenshot 2022-05-10 at 9.04.37 PM.png]] Extrapolation - May not represent real world ![[Screenshot 2022-05-10 at 9.06.13 PM.png]] ![[Screenshot 2022-05-10 at 9.07.02 PM.png]] - True statement, but didn't talk about taller vs shorter man. ![[Screenshot 2022-05-10 at 9.07.56 PM.png]] - Fallacy - ![[Screenshot 2022-05-10 at 9.09.19 PM.png]] - Doesn't mean woman discriminated, but more women applied to Hard U but not accepted. **Managing Change** - ![[Screenshot 2022-05-10 at 9.12.38 PM.png]] Real world, complex system changes and when change, is the analysis still valid? - ![[Screenshot 2022-05-10 at 9.13.32 PM.png]] - Social decision making.. - Working on the metric - ![[Screenshot 2022-05-10 at 9.14.29 PM.png]] - Because we choice ONE metrics as important, then people derived ways to manipulate that metric. - Managing Incentives - Bad data e.g people just give fake information. - ![[Screenshot 2022-05-10 at 9.15.42 PM.png]] **Case Studies** 1. Male and Female Mice have different biological processes, and experiments need to be conducted on both male and female subjects. 1. Collect the data from the right sample that represent population 2. Algorithms and race - Racial discrimination - Search Engine show ads that may be relevant to user. Search engine think black names higher chance of law offences 3. Algorithms in the office - Who to hire. 1. What kind of quality will succeed in the firm? For a Male majority firm, AI think male will perform better, so AI continue to perpetuate hiring male candidates. 4. GermanWings Crash 1. Privacy concerns, should mental health record be released due to public health/safety? 2. If people worry about confidentiality, then people will not seek treatment, making it more dangerous for everyone? 3. More random tests for people? like drunken test for pilots? Pick up issues, instead of revealing medical records? 5. Google Flu 1. Systems changes. 2. But need to have some ground data first, then maybe from search result to build better model to predict. --- # Algorithmic Fairness - Algorithms can be biased. Not just "follow the data" - E.g Hiring algorithms, in a "boys club culture" company. Bias against women in hiring process. - Correlated attributes - Racial discrimination - Surrogates. Cannot decide by race, but use postal codes (richer or poor neighbourhood) - Proxy attributes. That are correlated with race. - Unintentional Discrimination - Intent matters - What kind of discrimination to avoid. - Correct But Misleading Results - ![[Screenshot 2022-05-11 at 8.14.33 PM.png]] Visualisation of data can be misleading - Hotel ranking - Average Scores. Can be a mix of mostly 3 and 4, or mostly mix of 1 and 5. - Diversity Suppression (Handedness) - Minority Loses - Because the majority will influence the minority. e.g in Hiring process - and AI continue learning the flawed result - E.g Medical - Comparing two groups of patients.If Group A is majority., there will be suitable significance level. - Don't be too over simplistic manner of interpretation. #research - P-Hacking - P-value. - P <0.05. 5% - Multiple hypothesis testing![[Screenshot 2022-05-11 at 8.26.52 PM.png]] - If you test many things, sure will have something that is significant purely by chance. - Unreported Failures - Should design experiment first, then look for data. But now many reverse that... develop hypothesis after collected data. Case Studies - Example of Chummy Maps. Drama - Taking crime statistics, overlay on neighbours. It create bias against black neighbourhood and business. - Ratings from Users review. Who are mostly white users. - Impact on real world businesses, Who don't go there anymore. ==Important of sampling data. ==