Recently, there is a growing interest in applying AI in sectors that traditionally have been reluctant to use technology. The legal sector is one of them. Machine learning approaches are used to improve the work of entry-level lawyers. One application consists on extracting relevant information from the tones of documents that law firms possess for a case. In this talk, we are going to present a way to process unstructured data by means of Azure Cognitive Search and Databricks/MLFlow in order to extract that information to the lawyers. Another application relies on a solution for a class action case in which the law firm requires to select lead cases in order to represent the whole set of claimants in courts. This kind of cases consists on hundreds, or even thousands, of claimants. Different optimization approaches can be used for this case. We are going to talk about the one that we followed and implemented in Databricks/MLFlow. In summary, different uses of the AI are presented to help legal sector in its modernization.
– Hello, and welcome everybody to this talk about how can artificial intelligence be of help in the legal sector.
So here we are, my former partner for an Fernando Ortega Garlego. He is data engineer in Plain Concepts. Avila Eduardo Matalanas, former AI team lead at Blikams Ap UK. Recently moved to a new journey on joining Cabify in the last couple of months as a senior data science.
So today, as you come back to this talk, you are listening to us today, maybe because you are wondering how can a human lawyer can be replaced by that robot or, even farther, You are wondering if you’re going to be processed by an artificial, judge.
If you’re going to processed by an artificial judge that will give that their verdict based on the evidence. However, there are still too many implications in different fields, such as ethics, past feasibility etc. That we can only dream about it. So, as I don’t know, and maybe you can, we do join us, and you are thinking that this was a click bait, but it is not at all. Today wee are going to be focused more on how can artificial intelligence help in the legal sector.
So we are going to take some advantages from using AI and which can be like saving time or even, take less effort to translate directly this into savings or removing some costs. So can be used to reduce the effort on some tasks. In addition, it can be used to obtain better insights from the data and just objectively, the situation through a function. So as some of the obligations that we can expect from the, from the AI applied into the legal sector.
One of them could be like electronic discovery that relies on identifying and collecting useful information on legal cases. In addition, it’s necessary to categorize information, to make it useful. Also the next, could be the contract review. Next application could be contract review that consists on automating the review of contract clauses to validate them in a contract management workflow. The third one is that document generation, that supports the automatic generation of legal documents from client data. And the last one is legal research that helps law firms to make decision in legal cases. For instance, a class action. That is one of that is going to be the application that we are going to present today. We are going to focus on this application for the rest of, of the talk. So what, so we are going to, just move to now to this application, which is an example of how can AI be applied on the, on the legal sector. And many are wondering about, what is a class action? Right? One of the most famous class action is the one related with the Volkswagen emissions. A class action is a group of people, that are represented collectively by one of the parties, from a lawsuit. Okay? So as you may know, this case was, a big mess in the, for all the countries in the world. So there have been a lot of information and headlines around the case.
And in addition, lots of different laws firms have been involved to made the class action in different parts of the world. So to put simply on put some numbers into the, into this case, we can see that there were many people harmed by the specific, but this specific case, and also spread across different countries throughout the world.
My former office is based on the UK. That is why one of the biggest, law firms, called Slater and Gordon contacted Plain Concepts to create these, or to help them with this case. So the idea, was, to just, help them with this, case. And, we are going to present like how this problem came to us during the previous 10 months.
So as you know, Slater and Gordon opened a portal to collect all the information possible from the people affected by the emissions. However, they found that there was more, that there were a lot of data coming from the different people and claimants. There were more than 100 K claimants. However, not all of them were eligible. So we have, like maybe 90K, possible claimants that they were formulated and all the data was finished. But after applying some, data processing only just, 70,000 claims were eligible to be part of the class action. Due to some restriction based on the emission scandal. So the problem that we have at hand, was just to, how to reduce these 7,000, 70,000 key cases.
So they went to a, the idea is just to reduce these 70,000 cases to just only 200 cases. So they can be, be presented on court and they are called the lead cases. But what is a lead case?
Well, a lead case is a person who represents a class of the class action. And by means of its own case. His or her case is selected depending on how interesting is for the attorney’s strategy. So on how these, law firms has normally, do on the, on the past with the, with, with all the information that they have. They only use, people that were only looking for those cases and filtered them. So imagine the cost, time and human effort that was involved in those lead case selection of the class actions.
So our idea was mainly also to reduce that problem, but there are also other problems that came with this focusing on this kind of solution. There isn’t a suitable system to process and store all the claimants in one place. This lead to another problem, which is related with their analysis of the data and select those lead cases. And also is difficult to track the status of the different claimants and how the status of the documents are, or the status of the case. So I will going to let to my partner to explain which solutions that, which solution we came to. And you’re gonna, he’s going to explain everything for you all. – Thanks Eduardo for giving us a broad introduction description. Now is trying to explain our solution. We base it on two key concepts. Datalake architecture, genetical algorithm. A Datalake is a centralized repository that allows you to store all your data that is a tiny scale.
This DataLake architecture uses several, actual services like facial storage to store the data or the Datalake. Facial data factory to orchestrate ingestion by lines or several data services.
not sure that updates (murmurs) it’s also bits different data processing the steps like the genetic algorithm of lead case selection.
Finally, the DataLake tables are accessible from several planning applications like lead case selection. In our case, these application of leads stores later of claimants cases, the results of our selection of lead cases. These results when the written some over BI reports that are enough to sort of fairly to the lawyers to use them.
The data does is manage it in this architecture correspond to raw input data on features for genetic algorithm. Raw input data is the data provided by Slater and Goldman from their claimant portal. That is the legal document of the vehicle acquisition. Documents of the vehicle the sole documents if any, or the claimant’s personal data. Obviously this data needed to be cleaning, validated and represented before extracting the feature for genetic algorithm Regarding the features, there are two types of them require only extra. Required features are those features that a lead case would must fulfill in order to be accepted by the courts. For instance, the vehicle make, or whether the claimant is an individual or a busy case. Extra features were requested by the lawyers to prioritize some cases over the others. For instance, if the claimant sold the vehicle for less than expected, or if he or she is concerned about the environment. Obviously we applied some NLP processing rule based knowledge extraction to obtain descriptions.
Lead case selection is kind of initiation problem. So we can use genetic algorithms to solve them. As you probably know, genetic mechanism pollution, despite of this kind of optimization.
Now let’s present our Genetic Algorithm approach. First it’s necessary to initialize the population in this problem a gene correspond to a claimant. A chromosome to a selection of 200 genes candidate selection.
In general, the population must be as random as possible. In our case, we implemented a heuristic that takes samples for a clustering of the cases. The goal is just to obtain an initial population of chromosome, as diverse as possible.
Next, we begin our route by means of featuring. The treatments of the population. The fitments is defined as a multi objective function of the number of required features fulfilled by a list of claimants. The sum of required features other free objectives based on weighing the priorities between extra features. After that, it checks whether to stop or not. In case it continues, it’s necessary to select which chromosomes pass to the next generation. In this case, tournament was the best approach. It’s consists on selecting the best chromosome of random and small groups.
Now it’s time to apply crossover operation to mixed chromosomes. In this case, the two points approach was the best one, that we obtain it in our experimentation. Next a mutation probability is evaluated to apply random changes to a gene of all chromosome.
Obviously, now we need to compute again, the fitness to evaluate if this the loop, this iteration of the loop, improves the results. And where the loop finishes or bring cited to stop the loop, it returns the best chromosome of the current population.
But how many iteration were necessary? How do we obtain the base configuration?
We’ll send our flow to track experiments, obviously to easily compare them and to visualize the optimization results.
As usual, it was necessary to run a quick search over several values of the genetic algorithm parameters. And then a flow interface been possible to filter out in public these results. That seeds in our case host a number of mandatory features were less than 21 which was the of required features that the court needed for this case per section. And obviously we sort them by a score, otherwise, which is one metric that we applied and we tracked email flow, summarizes all the scores, but we explain it before in the three minutes, in this case, we can compare the top five results.
And this is like a comparison of the top five results. All of them improves more or less of the same pace. They also reach a plateau or maximum in which improvements are not possible anymore. It gave us the number of iterations that are necessary to execute, to execute, to attain the best. In this case, almost 80 operations, mostly iterations.
This is a summary of the top five configurations in which it’s easy to realize that they share almost the same parameters about key mutates, key mutational that corresponds to the function, applied to mutate a chromosome on the probability of mutation. These changes make a clear difference to attain the base from. As you can see here, this is clearly the value that makes the top one, results to make it obviously the best one.
Finally, the best configuration, the one that we obtain it, we see, we saw in the previous slide, fast to an outflow, gave us the selection of lead claimants. This selection needs, added a sub-table of the data Lake. So that’s lead case selection application, and this really important to this data.
Regarding the results, we sign it, a non-disclosure agreement of documents. So, you know, we cannot share more than blurred PowerBI reports.
This report is a summary of the claimants data. It’s, we can see here that, there are a high volume of cases for Volkswagen cars especially for them, for the model, Volkswagen Golf. Another obviously Volkswagen.
In this case, these reports shows a summary of the selected cases. That is 200 cases for our automatic selection and one case, one special case up Slater and Gordon requested us to work manually because they have a special interest in in that case from the, from the courts. Except for the first feature, which was, you know, the typical mandatory feature for the, for these courts.
The other one are somehow we’ll balance it. That is the well it was a requirement for Slater and Gordon to obtain a diverse, the most diverse group of lead cases.
In these reports, it shows a listing of selected cases. And their features, as you can see it is organized by the kind of feature that is mandatory. Which we formally named it required a feature and additional or extra feature. So it’s easy to, to debug, what sort of the lead case group that we selected automatically towards me, you know, provide us, a kind of feedback, to improve our iteration of the genetic algorithm.
Finally, it’s also feasible to see a detailed analysis of a case with the specific values of each feature and other sections that’s from a revised several questionnaires. That’s lawyers, and then after our automatic selection, because after this phase they needed to select from the 200 to more or less 20 cases to finally take data to the courts.
So now it’s fine for the concussions.
We are so proud for these, successfully, facing a lead case selection problem.
We also split it into Datalake base architectural rocks.
And finally a genetic algorithms it’s clear that they aren’t outdated, can be still use it many problems like probably lead case selection.
So thank you all for attending our talk.
Data Engineer, PhD in Computer Science at University of Seville. His thesis is about NLP. Specifically, he developed several Deep Learning Models for the identification of conditional clauses in user opinions about different products and services. In addition, he developed a scalable system based on microservices to integrate these models for the identification of conditional clauses with sentiment analysis based on aspects. Previously, he worked on different projects at the University of Seville and Dinamic Area, a company in which he cofounded and led the development of Opileak, a product based on the analysis of social network services.
AI Team Lead @ Plain Concepts UK. Data Engineer and PhD in Artificial Intelligence by Universidad PolitÃ©cnica de Madrid. His thesis focuses on the application of collaborative artificial neural networks to optimize electrical demand. He has published several articles about applications of AI algorithms in international journals. He has developed for more than 10 years studying and applying several AI algorithms and techniques for a wide range of areas like digital signals, robotics, image processing or natural language processing. Finally, he has also developed different applications with virtual assistants and chatbots to improve everyday worker's tasks.