Data is the new IP – AI can’t exist without a strong data acquisition and curation strategy. Planning the data pipeline, governance, and for growth and updating models regularly needs to be part of the AI strategy from the outset.
This session will cover:
3 key takeaways or attendee benefits of the session:
Kirsten Gokay: Thank you so much for joining my session, AI Data Acquisition and Governance: Considerations For Success. My name is Kirsten Gokay and I’m a Product Manager at Appen. Today, I’m going to cover a few key areas in the AI life cycle. These are defining AI governance, data governance and how it fits into AI governance, the growing necessity of AI, and planning the training data pipeline, and maintaining the models you used to power your AI. Let’s start with defining AI governance. AI governance is the framework that guides an organization’s AI usage and implementation. How an organization defines its AI governance framework may be influenced by the industry it’s a part of, internal corporate rules and regulations, or its local laws.
There’s no one-size-fits-all approach here, so each organization has to determine what works best to model its values and suit its needs. Depending on your source, you may see different key areas of AI governance, as again, there is no one concrete definition. But here, I’ll talk about three that consistently pop up, which are performance, transparency, and ethics. Performance includes things like accuracy and bias. Accuracy refers to how well AI performs and tests on real-world data. Bias, or fairness, is generally pretty reflective of human bias and prejudice. Digging a bit more into performance, let’s look at accuracy. We can view accuracy in different ways.
You may choose to look at overall accuracy of your data. So essentially, for a randomized subset of your AI’s predictions, what percentage of the predictions are correct in total? This can be problematic if your data set is not well-rounded. For example, you might audit 100 randomized rows of your data, and see that your overall accuracy is 98%, and you think that means an accurate model. But if you dig into the 2% error rate, and see that it mostly consists of an underrepresented class, then the accuracy in that class could be very close to zero because of how sensitive it is to errors. This is where precision and recall come in as more insightful accuracy indicators.
In that scenario, you’d instead see that your precision and or recall are quite low for that class, indicating that your model needs more training there. There is a trade-off for precision versus recall. In some cases, precision might be more important and vice versa. A good rule of thumb is to consider the real world implications of what happens if your model isn’t precise enough or misses too many cases. An example I like to use here is in healthcare versus e-commerce. If you’re training a model to identify pneumonia in lung x-rays, you probably want pretty high recall because you want to find as many cases of pneumonia as possible at the cost of precision. It’s probably better overall to have a false positive than a false negative in this case.
On the other hand, if you’re training a natural language processing model to recognize negative product reviews for customer support follow-up, you may want higher precision at the cost of recall so that your support agents aren’t wasting time reviewing potentially non-negative write-ups. The implication of the pneumonia model having too low recall could mean a patient doesn’t get the treatment they need. High recall may simply mean the doctors spend a little extra time reviewing x-rays. The implications of the product review model having low recall could potentially mean fewer customers, but high precision means support agents aren’t spending extra time reviewing positive write-ups, which they really wouldn’t need to be doing in this case.
The completeness of context also plays a big part in the performance of your model. For example, if you’re using AI to predict how busy a store might be at a given time, it needs to take into account the day of the week, and whether it’s a holiday, where most businesses tend to close. Only having the patron density data of the store might cause AI to predict that it will be busy at 6:00 PM on a Thursday. But if that Thursday happens to be Thanksgiving, the store is probably closed. So AI needs to be able to perform in the real world, so being able to use context in the way that humans do can help its accuracy. Another aspect of performance is bias or fairness. There are different ways to introduce bias into AI, two of which I’ll go over here.
First, there’s sampling bias. This comes down to the type of data collected to train a model and it’s usually a result of collecting a skewed dataset. This can directly impact the precision and recall of your model. So accuracy and bias are pretty tightly coupled in this way. In order to train an accurate model, you need to ensure your data set reflects the real world as much as possible in the ways that your AI will be interacting with it. For example, thinking about our x-ray scenario, if you don’t have enough cases of long-term pneumonia, your model is biased towards healthy lungs and it will under predict pneumonia. Another type of bias here is in the annotating of the training data or the human bias.
This is hard to avoid, but it can be mitigated by training the people annotating data to avoid introducing their own prejudices into their work. Bias can have pretty severe consequences in the real world, so it’s paramount that organizations work to mitigate it. How you measure performance in your AI implementation is important for governance. You want to ensure you have high accuracy in whichever area is most important for your organization while also making sure the data use to train your AI as well-rounded to eliminate as much bias possible. Next step, let’s talk about transparency. One aspect of this is explainability, which is the ability to explain why AI has come to a particular decision.
This does not mean that you need to open up the hood, so to speak, and point to an exact line of code as the culprit, but you need to have an understanding of the data the model is trained on, the features, the inputs. This is important for heavily regulated industries, such as credit or loan processing, to ensure that no legal bias has influenced the AI decision. This also comes up more as more countries are making laws similar to the European Union’s General Data Protection Regulation, or GDPR, which gives individuals the right to know how their data was used to reach a decision. This is one particular area of AI governance that is often impacted by legislation. You should also keep in mind the objective of the AI implementation. What’s the end goal? How’s it going to be used?
Part of transparency is being able to answer these and other questions. Finally, we get into ethics. This is a pretty sticky area, because obviously, not everyone has the same set of ethics. This is where the industry that your organization operates within will probably play a big role. Based on these ethics, your AI governance framework should address the intent of your AI implementation and the responsibility you have to ensure that its actions align with that intent. This is different from the objective, which can be the end goal of the implementation. For example, an objective might be to accurately forecast weekly sales at a grocery store. The intent behind this objective might be to decrease food waste in the supply chain.
So intent should drive objective. Those that are building out artificial intelligence have a responsibility to use it with its impact in mind. AI is a really powerful technology and it should be treated as such. Let’s round out this section with some case studies on the consequences of bias, just fun. The COMPAS algorithm, which is probably one of the most common examples used when discussing bias and AI, it stands for Correctional Offender Management Profiling for Alternative Sanctions. It’s used in various parts of the US to provide a recidivism score for a defendant. This score is one piece of information used by judges in their sentencing of that defendant.
The problem with this algorithm is that it’s been shown to provide a high-risk score for black defendants at a rate twice as high as that for white defendants, though the actual rate of recidivism is about the same for defendants of either race. The algorithm itself doesn’t use race as a factor in its calculations, but it’s trained on bias data such as the area that the defendant is from. Often, neighborhoods where people of color live are targeted more heavily by police and result in higher arrest numbers. So the training data is biased. It’s also important to note that the COMPAS algorithm is considered proprietary, so only the company that built it knows how it works, which really provides no transparency into the AI. Here, we have bias and transparency problems.
In the world of facial recognition, most of the major tech companies have their own AI offering and several have been shown to have pretty abysmal error rates for black people and people of color. In a 2018 study by the Algorithmic Justice League, it was found that the error rates for guessing gender based on a data set of roughly 1,300 faces was about 21% for darker skinned women in the Microsoft model and 35% in the IBM model while each having an error rate below 1% for light skinned men. This is a direct result of the training data being heavily skewed towards white males, and therefore, the models having inadequate data points for training.
It is important to note that IBM has recently stopped working on facial recognition technology after recognizing the associated risks. Amazon had built in AI to recommend candidates in their hiring process. It was trained on resumes submitted to the company over a 10 year period, which unfortunately were mostly submitted by men. So the model often downgraded resumes from women, marking them as less qualified. Amazon scrapped the project once they realized what was going on. Finally, just another really insidious way that bias impacts AI is in the US healthcare. One particular algorithm that’s commonly used to help hospitals and health insurance companies determine which individuals may benefit from high-risk care management programs was found to incorrectly calculate similar risk scores for black and white patients.
In a group of patients who were all given high risk scores, the black patients actually had about 26% more chronic illnesses than the white patients. So all people in this group were given similar risk scores despite being at dissimilar risks. The data used to train this model relied heavily on previous healthcare costs for individuals. And because of all the various social issues relating to healthcare use, access, and treatment, black people are often getting healthcare treatments less often than they should be, putting their costs on par with those of white people. I will say all of this is actually really fascinating stuff and there’s a ton of literature out there, so I really highly recommend looking into it, and it’s not supposed to be alarmist at all.
It’s a bit of a cautionary tale, right? Just good to take into consideration the impact of bias when building out AI. And with that, let’s move on to data governance. Data governance is how an organization manages the data in its system. It’s crucial for working within the organizations AI governance framework, and it focuses on availability, usability, integrity, and security. Data availability refers to how accessible the data is to the users who need to consume it. Not everyone in an organization needs to have access to all data, so part of availability is determining who can see what. For example, an engineer at an e-commerce company may not need to know the purchase history of a customer, but they probably need the event logs in case they need to debug an issue.
On the flip side, I’m a customer support agent at the same company probably doesn’t need that customer’s event logs, but they might need their purchase history to help with support inquiries. Data usability insures the data that users have access to is clearly structured, queryable, and easy to use. Users at an organization can easily waste a ton of time trying to wrangle data that’s not already usable, so good data governance can help to avoid these headaches. Data integrity ensures the data maintains its structure qualities and completeness across its life cycle. As it’s transferred between systems or viewed by different users in different contexts, these aspects need to remain the same.
This ties pretty directly into data security and that data is protected from corruption or unauthorized modification. There’s a whole world of thinking around how to keep data secure, from ensuring your code has no vulnerabilities, to preventing users from taking advantage of features in the product, to securing any external API end points. Security may mean that data is inaccessible to external parties, but as I just went over, it can also be applied internally. Especially at large organizations, there will generally be divisions among who has access to what data. Now, we have a few case studies of when security has unfortunately failed. Unless you’ve taken very extreme measures to stay offline, your information is most certainly stored in some company’s database somewhere, and probably a lot of companies, quite frankly.
The majority of the time, this really isn’t an issue because organizations tend to work pretty hard to keep their user’s data secure, but it does seem that more and more, we are hearing of pretty massive data breaches. There is actually a really great visualization on information as beautiful that shows that data breaches since 2004 where more than 30,000 records were lost. Once you get to about 2013, it really starts exploding. This actually makes sense because people have become more online in the last decade. So it’s not surprising, but it really underscores the need for greater data security. With that, we’ll go into these case studies. So in 2017, Equifax, which is one of the three main consumer credit reporting agencies in the US, reported that they suffered a data breach affecting 143 million of their American users.
The data lost included names, addresses, and social security numbers, among other records. That number alone is pretty impressive, that is nearly half of the United States population, but what made this breach so much worse is the type of data that was accessed. Equifax was sued by some local governments, and investigated by state and federal government, and there have been reforms proposed to give users greater control over their credit data, along with more transparency into how their scores are calculated. So here, we’re looking at both security and transparency. In 2019, some unprotected databases containing 419 million records of Facebook users were discovered online.
These databases were not password protected, so they were available to anyone with an internet connection who could find them. The records contained each user’s public account ID, along with the phone number that was listed on the account, which put all of these millions of users at risk of getting even more spam calls than I’m sure they probably already get, not to mention opening up the possibility of SIM swapping attacks, which is when an attacker tricks a cell carrier into basically switching someone else’s phone number onto their own SIM card. So then this routes all calls and texts to this new device, meaning any accounts using two factor authentication with that phone number are now compromised. Even if a phone number leak doesn’t seem inherently dangerous, it can lead to compromises of a person’s online accounts.
In 2018, Aadhaar, which is India’s national ID database, was found to have a security hole that compromised the identities of all 1.1 billion registered users or registered Indian citizens. This database includes citizens unique identity numbers, along with certain bank and utility services information. This is a government controlled database that the vast majority of the country uses. It’s not mandatory to register, but without doing so, one cannot access government services. So it is functionally mandatory for most people to register. The Equifax data breach was due to a known vulnerability and their web application framework that they failed to patch, despite being notified.
The Facebook breach was a few unsecured databases with scraped information. The Aadhaar beach was due to an API endpoint that was both unsecured and had no rate limiting in place, so an attacker could potentially continuously hit the endpoint to retrieve all available information. Each one of these breaches could have been avoided by having proper data security controls in place. It’s not something to be taken lightly, but it’s also not something that’s unavoidable. Okay. Now that we have a general understanding of data and AI governance, let’s talk about where to use AI. The use of AI has been expanding dramatically over the past several years, and of course, for good reason.
It allows organizations to scale quickly and efficiently. It can also provide data quickly, improve user experiences, or even save lives. You may interact with AI in your day-to-day life in more ways than you’re fully aware. There are the very clear examples, such as when you ask Alexa to play a song, or you send a text to a friend simply by dictating it to your phone, both of which are examples of natural language processing. But there are also the little things, like when you start typing something into Google and it seems to know exactly what you’re looking for, or when you unlock your phone with your face. All these little things are just the daily benefits of AI. In thinking of opportunities for AI, these are just a few areas where it can provide massive benefits.
So 24/7 chat bots to assist with customer support while the workforce is offline. AI can use natural language processing to assist with clear cut questions, and requests, and down the road, more complex interactions as well. Product recommendations, so going beyond the basic analytics that show that customers who bought this also bought this. AI can actually be used to identify similar products to one that a shopper is viewing or has already viewed, sometimes by comparing textual product details or even using computer vision to identify key characteristics of the product. I’ve actually worked with a customer who was training a computer vision model to identify key parts of clothing. So for their service that they provide, most of their products are user-generated, and a lot of the information that’s accompanying those products is incorrect.
In order to better recommend other products to someone who’s viewing a certain black dress, for example, they might pick out the length of the sleeves, or if there’s a pattern, or what style of dress it is to surface other similar products. This is really different from just using analytics to look at items that people often buy together, because in this case, it’s not necessarily about upselling, it’s about making sure that the user finds what they’re looking for before they leave your site. Businesses can also use AI to scale their internal processes.
So they can extract information from contracts to automatically populate a CRM using Optical Character Recognition. Another super common use for OCR is receipt processing. So companies from reimbursement apps, to banks, to personal shopping apps use OCR to capture key information from receipts so end users don’t have to enter it themselves. Moving on into the world of robotics, autonomous vehicles are obviously a hot topic, but there’s so much more potential than that. Robots can perform small movements with extreme precision and they also don’t get tired. So they’re really capable of handling long or repetitive surgical operations.
Robots can also provide greater visibility into what’s going on inside of a person that’s being operated on so the surgeon can be made really rapidly aware of any problems during surgery. Robots can be assistants to our seasoned doctors with all their expertise so we get the benefits of both worlds. Again, these are really just a few areas where AI can provide real value, whether it’s for efficiency, or capital, or safety, but there are many others. The key is to identify what products or processes in your organization can benefit from an AI implementation. Now, what to consider for the training data pipeline and the maintenance of your AI implementation.
Of utmost importance here is the data used to train the models that power your AI. There’s a lot that goes into the training data pipeline, but I’ll go over data acquisition, data annotation, auditing, and then updating the models. The first piece of the training data pipeline is getting the data. An organization may already collect the data that it needs to get started and it’s just a matter of organizing it for the right purpose. That’s not always the case, though, and the data may need to be obtained externally. There are tons of open source data sets out there that are great for getting started with machine learning, but very often, they’re not particularly nuanced, they’re more for general use.
A common approach is to work with a third party vendor to generate data. For example, if you’re training a voice commanded virtual assistant, you need a ton of data to train it to understand human speech. There are companies, such as Appen, that work with people around the world to generate voice utterances based on particular prompts. This is great for getting a broad representation of actual human language. Once you have the data you need, it does need to be annotated. Using that same voice utterance example, this part of the annotation may be transcribing the audio, or it could be identifying different parts of speech, whatever is needed to train the specific model.
Having annotated data is critical for supervised machine learning. You should also frequently audit the data that you’re using to train your model. This is part of understanding the data and providing transparency. If your annotations are low quality, your model and AI will also be low quality. It is, as they say, garbage in, garbage out. This means your data needs to be well-rounded and account for edge cases to avoid overfitting the model. This also ties into keeping the model up to date. In general, you really shouldn’t use a static model, which is one that’s not retrained, but rather, you should use a dynamic model, which is frequently retrained and it’s updated to reflect changes in real world data.
As long as you’re keeping in touch with your model performance, you’ll have a lot of opportunities to reassess your AI implementation. Something I’ve been excited about in my own work is a product we’re developing to render annotated data within our own SaaS platform. This will give data and machine learning scientists meaningful insight into the data being used to train their models, which will provide greater transparency and help them understand what their data looks like before it trains the model. All right, I’m going to highlight one example of successfully maintained AI and one of, unfortunately, a very public failure.
So first the good, which is Wellio, which is a platform that helps people plan and prepare at home meals. They wanted to deploy AI that would help people learn how to cook healthy meals, how to plan and shop, and how to adapt mid recipe if something goes awry. They had tons of data, but it was all unlabeled, so they worked with Appen to label the data. Taking it one step further, they built out a training data pipeline that sends new, unlabeled data through the Appen platform, then takes the label data, and feeds it back into their models. They also use actual chefs in the annotation process so they can be confident that the new annotations are high quality.
Their models were trained, sent into the real world, and are constantly being retrained. This is a really great example of constantly updating AI with real-world data to keep it fresh and ensure longterm accuracy. On the flip side, a very public failure of AI was Microsoft’s Tay chatbot. A few years ago, this chatbot was released into the Twitter sphere, having been trained on public data, along with some content they got from comedians. It was supposed to write like a teenage girl who is very much online and at first that was the case. But it was also learning from real world Twitter data, which of course was unlabeled, and it rapidly devolved into a racist misogynist as a group of people specifically tweeted… or they targeted vitriolic tweets at this bot.
It was learning from real world data, and quite frankly, its speech and its syntax did fit reasonably well into what you would expect from a teenager who’s tweeting, but it had no… or at least very few built-in controls to ensure that it acted within any defined set of social rules. Here, we have not a problem of AI going stale, but rather one of AI running way out of bounds to the point that I had to be brought down almost immediately. Maintaining your AI means not only keeping it up to date with fresh data, but also ensuring that it continues to act appropriately in the real world.
If it starts drifting away from the intent with which it was built, it needs to be reigned back in. Alternatively, you may actually find that your organization’s intent has changed and that needs to be applied to your AI. On that note, I conclude the session on AI and data governance. I do hope that you learned from the defined frameworks, the successes, as well as the cautionary titles, and best practices outlined here. I hope this was valuable and I look forward to the Q&A session to come. Thanks.
Kirsten has been with Appen for over five years and has seen countless ways in which people succeed - and fail - when implementing AI in their business processes or products. As a Product Manager, she...