Using Machine Learning Algorithms to Construct All the Components of a Knowledge Graph

Our machine learning algorithms are the heart of our ability to deliver products at Reonomy. Our unique data asset is a knowledge graph that connects information for all commercial properties in the United States, to the companies and people that own and work in those properties. This graph is built with models that perform the entity resolution that defines the vertex types, and the attributes on the vertices, as well as create multiple edge types in the graph. Other similar data assets focus on a significantly smaller subset of properties and/or are manually constructed. The volume of the data, as well as the required quality of the connections, restricted us to the best-in-class tools, computational power, and technical stack. It also provides us an exciting opportunity to build something that has not yet become widespread enough, for there to be well known formulas for how to build the data asset and construct deliverables. Having your models define the shape of the data asset that is used to build all the products for the company, makes every choice critical, especially when you are a growing startup supporting about a dozen different models. I’ll walk through examples of critical code design choices, cluster configuration choices, and choices on algorithms that were necessary to successfully build the graph components. You’ll walk away with key points to consider when implementing production quality models which are embedded in high volume data pipelines, as well as a logical framework for building knowledge graphs that are able to, e.g. support a diverse set of property intelligence products.


 

Try Databricks

Video Transcript

– Hello everyone. Thank you for joining my session. I’m Maureen Teyssier, Chief Data Scientist at Reonomy, which is a commercial real estate company, creating actionable property intelligence.

Using Machine Learning Algorithms to Construct All the Components of a Knowledge Graph

Today, I’ll speak to you about knowledge graphs, about why we use one and how to use Machine Learning Algorithms to construct all of the components for a knowledge graph. A little bit about me, I was an academic for, well over a decade. I use cosmological simulations to study the evolution of galaxies. On the right side, you can see a simulation of the filamentary structure of the universe, and on the left side, you can see a visualization of that same stimulation where we resolved one of the nodes in that filamentary structure to show you a spiral galaxy.

I have been in the startup space for a little over five years, and I have worked on projects that have used knowledge graphs for about four years.

I have always straddled the line between data science and engineering. I love it. And the work that I’m currently doing in the commercial real estate space that’ll speak about today, gives me the opportunity to continue to do just that.

This is what people think about when they think of commercial real estate. But there are 52 million parcels of land across the United States that cover much more than office space and retail space. They cover agricultural land, they cover transit hubs. They cover vacant land. We track over a hundred attributes on each of these properties, as well as their financial histories. And then we track even more data, if there are multiple structures on the land.

The information that we have from these structures has historically been stored in paper, in forms, in filing cabinets, in books in over 3000 local tax assessor’s offices across the United States.

In this modern age, we have an electronic version of what is essentially just the same forms for the use case of collecting taxes.

How do humans interact with these physical structures?

But if you think about how most of the human beings in this country interact with all of these physical structures they are not collecting taxes. They are building these structures or expanding on them. They are working within them. They are doing maintenance on the buildings. They’re improving their infrastructure, or maybe they’re using the value of the properties as one component of a much larger market. And tracking that over time, then trying to make predictions of the future. Or possibly they own a portfolio of these structures that they use to generate income.

The information from these structures, the very detailed tax information that we have, really only covers one aspect of what is a picture that connects people, companies, the environment that the structures are within, a demographic information for the population that lives there, the financial history of the structures, as well as just the structural information. So we collect information from the tax assessor’s offices, but we also have data sources that cover companies across the United States and all of their subsidiaries, their DBAs, LLCs that make up a company structure that is both the owners and the tenants for these properties. We also collect demographic information. We collect location about the financial history, information about the financial histories of these properties. And individually, each of these pieces of information are only one aspect of this complete picture. And none of them, even the ones that are focused on people and the contact information from the people and the demographic information from the people. They don’t cover all of the aspects of a person because of their complex relationships with companies and the relationship that they have through the companies or directly with properties. It’s only when you combine all of these datasets that you’re able to create actionable data. For example, you can reach out to a property owner or a person that owns portfolio, these commercial properties about additional investments, or you can reach out to the maintenance person for a company, collect their contact information and submit a bid to do some construction work, some maintenance work on their building.

We’ve created this actionable data, actionable data with a knowledge graph, which is an ontological structure that is filled with data.

An ontology is defined by the Oxford dictionary as a set of concepts and categories in a subject area or domain that shows you their properties and the relationships between them. On what that means is that you are organizing the data in a way that reflects the real world with real world concepts.

So instead of a person who is a owner or a person who is a seller or a person who is a … You’re capturing the information like … Oh, crap. I’m sorry. I kinda got thrown off.

Okay.

Thank you.

Structures that can support Property Intelligence 9 aay a

So we capture this actionable data through a knowledge graph, which is an ontological structure filled with data. An ontology as defined by the Oxford dictionary is a set of concepts and categories in a subject area or domain that shows their properties and the relations between them. What that means is that we’re actually capturing the entire world of commercial real estate in a way that the human beings that interact with it understand it.

Structural Components

Where we’re organizing the data to reflect the real world. So we don’t have context specific buyers, owners, sellers. We have a company that is a tenant. We have a person that is a buyer and also a seller. We have … Right? So it is recognizing the real world concepts and by approaching it that way, you can build much more flexible products and your data is holistically aggravated onto the things that are the fundamental components of your data structures. So here is a cartoon structure of our knowledge graph. It’s very simplified version. And you can see the connections between the different entity types and the different edge types within our graph. You can see that, there’s often a hierarchical structure for companies. And in fact, we capture the hidden ownership, ownership which is often hidden. Entirely legally hidden for the purposes of strategic and sales.

Looking a little bit more closely at the structural components of the knowledge graph.

Structural Components: Ramifications General Soar

You have your entities. So all of the nodes are different entities with different types. Looking at that entity type person, each of these nodes, these entities are going to have unique IDs and they all have attributes that live on the entity and the attributes per person would be first name, last name, middle name, gender, age, some income bracket et cetera. You also have providence that lives on the entity so that you know where the attribute information came from and your confidence level for the attribute information for each entity.

We also have edges. I’m showing edge type that is a manager edge type. Edges are structurally very simple. They only require a type of from ID and to ID. Other components that could make up the structure of the edges are optional based on your product requirements.

So your edges could have a strength. They could have a directionality but it’s not a requirement. The power of using a knowledge graph, of using these structural components is that you’re capturing the entity in this very general way that I described a little bit earlier. So instead of maintaining code that is specific for owners and code that is specific for secretaries and code that’s specific for signatories, you’re maintaining pipelines and code that is specific for one structure. As a person structure. And the contextual information is contained on the edges, right? And the edge structure in and of itself is also very simple. You try to maintain one edge structure for the entire ground. So you’re actually decreasing the amount of code that you see. Generally in these types of pipelines, very dramatically by using this graph structure.

It’s enormously powerful and allows you to build much more quickly than you otherwise would.

So building these components it is worth taking a step back and looking at the current status of this field. There has been publications on the benefit that you get when you combine information, data sources outside of the context specific view that they’ve been delivered in since the mid 1800’s. And there was a bit of a resurgence and the idea of knowledge graphs in the 60’s as led by Newcombe and Acheson, but you don’t really see growth of the field, strong growth until the mid 90’s. I pulled data from the archive API to generate this figure where you see the number of publications per year, and you see how quickly the publication rate is growing.

Construction Methodologies

So this is excellent news for academics. There are a lot of things to discover.

It’s a phenomenal time to be doing research into knowledge graphs.

For industry, it’s also a very exciting time. It means that the tools and the methodologies are growing very quickly. They’re becoming more and more scalable and able to handle larger volumes of data.

But what it also means for industry is that this is a very nascent field. There isn’t a textbook that you can go to that will tell you how to use modern engineering tools in order to build a knowledge graph. Essentially, we go to newspapers, we read white papers, we read academic papers. Or we read anything that we can get our hands on, and we do tests and we build prototypes and then we do full implementations. And that’s how we’ve collected the knowledge that I’m speaking to you today.

So let’s talk about some of the modern ways that you can build these components.

One of the wonderful and terrifying things about knowledge graphs is that you will have many models living within your engineering pipeline. There’s no separation from your pipes. You don’t have a situation where you’re delivering data into a data lake, and then your data scientists operate on the data lake. Your Machine Learning Algorithms, your AI, your statistical methodologies, they live within your engineering pipelines. They define the shape of the lake. So I’ve separated the construction methodologies into two different … into two different sections in order to emphasize the ramifications on the engineering. Because although when I speak about these methodologies, they might look like different sides of the same point. What you’ll see is that they have very different implications for what the pipelines look like and the architecture looks like.

So let’s speak about the maximalistic, destructive methodologies that you can use. So if you choose to go this direction, you’ll be creating your edges first. And what I mean by that is you will be creating as many edges as you can using explicitly defined information in the source data. So if you have primary and foreign keys go ahead and use them, you will need them and then some. You will also have information that is implicitly defined in the source data. So just the fact that information is present together in the same row will mean that there should be an edge between the entities that live together in that row. So you have these edges, you throw it up into a graph structure, and then you start examining the edges and the entities that you have that are unresolved entities.

And you go through the operation, that’s called edge contraction in order to create entities. So I have two graph fragments here. So Jenny Smith has represented twice. Both Jennys have a phone number 867-5309. And luckily a phone number is a very unique identifier. And so you can use the fact that it is a unique identifier in order to contract the phone number and then to traverse to the Jenny Smith entity and contract that entity. And then you end up with a structure that has some aggregated attribute information, right? We have her age and her location, and we have the fact that we have three properties that are connected to her. Whereas previously we only had part of the complete picture. So this is the absolute simplest case.

And one of the benefits of exploding your data into these graph structures right off the bat. Well, after you’ve done some standardization and some cleaning, is that you can try to capture as many of this simple cases as you can. But the reality is that you will go very quickly from simple cases to something that requires a lot of Machine Learning. Potentially you’re using a deep learning model that’s taking the structure of the graph as well as the attribute information and the entity types and making a decision about whether to do a contraction or not do a contraction.

This traversal heavy methodology has implications for the runtime. And a lot more, I’m gonna mention that that’ll come up again later. But let’s go to the other avenue, the minimalistic constructive way that you can approach building a knowledge graph.

So with this approach, you’re taking an entities first approach, probably. So to illustrate a simple example, let’s say that you have person information in one data source or orange data source, and you want to compare all of the records in that orange data source against the person information that lives within all the records in the yellow data source. So you would be doing a huge volume of comparisons you have to mitigate that. That’s problematic. And the way that that’s done is using an adaptive blocking methodology. So you take this huge volume of comparisons and you use a fast, a high recall method for eliminating a lot of the comparisons that you would be completely unreasonable to do.

So there are a lot of methodologies that you use for adaptive blocking.

The simplest one, most common one is to use LSH and you can do some joins with a threshold, and then whittle down the number of comparisons that you actually have to explode into a much more thorough feature set and then feed into your high accuracy model in the second stage.

And this high accuracy model is making the decision, do these pieces of information belong on the same entity or not. So you’re using a clustering algorithm, you’re using classification algorithm, or you’re using a probabilistic methodology.

Now, although these probabilistic methodologies have been shown to be very successful in academic papers. They usually don’t scale well because they involve the calculation of (indistinct) priors. And it’s only recently that work has been done that actually iterates through makes an approximate calculation for the prior, and then iterates through this two step methodology multiple times.

You can take a look at some open source code. There’s has been made available by this academic groups called deep link. I highly recommend it. It’s written in Spark Scala. And so if you’re using Scala pipeline, then you should be able to play with it right away.

Adaptive Blocking and Skew

So after we’ve created our entities, we’ve resolved our entities. Now you have something, you have unique ideas that you can attach your edges to. And to create these edges, you grab everything that you can that is available from the source data, of course. But you will end up doing the same adaptive blocking and then high accuracy model is the second step process in order to capture your edges the same way that you did …

The same way that you did to create resolved entities. Right? So instead of answering the question, does all of this information belong on the same entity or not? You’re just asking slightly different question that says, “Is this entity related to this other entity or not?”

And there’s no way to build those edges, using this minimalistic constructive methodology, if you are trying to make connections between completely different days or sets.

Okay.

One thing that is a byproduct of the more minimalistic constructive way of building a knowledge graph is that you will be generating a lot of skew if you are doing adaptive blocking well, right? So if you’re blocking methodology is working really well, then you will be creating skew that mimics the underlying distribution of data.

And to illustrate that a little bit consider the prevalence of John Smiths in the United States, right?

Construction Methodologies

There are about 50,000 John Smiths in the United States. There’s one Jamie Samoa and naively you could say, “Okay, I’m going to have blocks “that are 50,000 times larger than other blocks.” The reality is a little bit worse. It’s not likely that your data sources have resolved. the John Smiths as well as they should, just because there are so many of them. So you’re dealing with a ratio that’s a little bit over 50,000 to one.

But in your distributed pipeline, you can mitigate the skew that’s created by adaptive blocking methodologies by changing your partitioning strategy, increasing the size of the partitions, increasing instance sizes for the machines that are making up your cluster in order to prevent overflow from a super large blocks and make sure that your job is still chugging along at a good clip. So to kind of sum up the two approaches to constructing your knowledge graph, maximalistic, destructive approach generates your edges first. It’s rose all of the data into a graph structure early, and to do that, you’re exploding your data. So you’re generating edges and nodes that won’t end up being there when you have your completed knowledge graph. And this in and of itself could be a big problem for you if you have very large volumes of data.

And if you’re intent on throwing these graph structures into your graph database early, you’re gonna have to load all of these additional edges and nodes into Neo4j or AWS Neptune, and that’s gonna increase your load time. It’s also a traversal heavy methodology and traversing a graph. Anybody who has worked with the graph knows is that traversing graphs is slow. And the greater the traversal, the slower your query and the slower that you’ll be resolving your entities.

However, this approach does create maximal edge creation. You are going to have the benefit of the existence of all of these edges, regardless of the source. And potentially you will create edges that you didn’t anticipate using for the product, but will be beneficial for doing resolution. And so there could be unexpected advantages to taking this approach.

For the minimalistic constructive approach you’re creating your entities first. It’s much more compact. So if you have larger volumes of data, you’re not loading things that you’re going to destroy into any kind of graph structure or saving those things. You’re eliminating them from your pipeline as early as possible. You will be creating a lot of skew that you will have to compensate for.

You will be creating only what you think you need, right? Which could be great, or it could cause you to miss some of the relationships that are present in the data.

And then this minimalistic constructive approach also tends to have a little more flexibility and attribute choice. When you take the other approach, you want to explode the data as much as possible, so that you are able to take a simple approach to doing the entity contractions as you can. You don’t need to worry about that so much if you’re taking this minimalistic constructive approach. You can really collapse the data as much as you need to in order to get the data through the pipelines. So, there are product ramifications for any of the choices that you make around the methodology that you choose, around the models that you choose. Really an entity resolution itself. There are only two things that you can do. You can under aggregate your information for the entity, or you can over aggregate. If you are under aggregating, that means that you’ll be missing some of the attribute information, but you’ll also be missing some of the edges.

And this means that whatever products that you’re building off of the graph. will be missing that information, or you’ll have duplicates that show up in your UI or your API, where your customers are accessing the information. If you’re over aggregating.

You run the risk of connecting having kind of a cascade effect where you’re connecting large graph documents that really should not be connected. And so this also causes similar product problems, but in the other direction.

So it’s important to kind of think about what those ramifications are when you’re tuning your model and when you’re approaching how to build the pipelines.

Another thing to consider is if you are constructing the edges, you’re using a model it’s non-deterministic the model is going to have a confidence level, or there’s going to be a probability that comes from your statistical methodology that you can translate into the strength of an edge.

So you can choose to record that information on the edges and surface multiple edges, or only surface the best edge that you have for each edge type or each … Not each edge type, but between two entities.

And these choices also will affect the product, affect your ability to traverse the graph and affect how the information is displayed in your product.

Another tip is that it’s always beneficial to include your domain experts when you’re building your ontology.

Spark Evaluator & Habit

Even if they’re completely nontechnical, it will actually be easy for them to engage the way that you’re organizing the information, because ideally you’re representing the world and is true to life way as possible.

Another point is that your graph is not going to directly support your product. So you are not going to be querying the graph from the UI. You would be delivering the data through AWS Lambda or through elastic search index, and to kind of mitigate the issue that people have when there’s an unexpected amount of traversal when you’re collecting information from your graph.

And then a last point here is that your collaboration between data science and data engineering has to be on point. It will absolutely affect the way that you’re able to deliver the data and build the graph. To kind of elaborate that on that a little bit more and provide an example.

I’d like to talk about the spark evaluator and the way that we think about data. Data scientists handle data and think about data in a column wise way.

We create distributions of data. We do feature comparisons. A lot of the handling we do is column wise.

Well, the spark evaluator is not column wise.

It’s row wise.

And I’m so happy that we no longer have to think about load balancing and message passing and writing our own tools. We have spark that does that for us, but we still have to think about how data is being moved around on distributed systems.

And we can’t be fighting the spark evaluator. And when you have a habit of thinking about things. in a column wise way, you write code that is reflective of that thinking. And that’s something that I see everywhere online and all of the resources that we all use to help us learn about the way that our peers are writing code. I see UDS everywhere. I see SQL, varies everywhere, and that kind of reflects this column wise thinking, this column wise habit. When you have data scientists and your data engineers working together on a platform like databricks, where they’re not only just doing code coding, but they can also show each other a big chunks of data. That means that you will be able to overcome the habits that both sides have that prevent really good, robust, healthy pipelines from being built. So the code examples that you have looking at on the slide, show a record wise way to write features for a model using the scholars apply method, completely circumventing the need to use any of these UDS.

Summary

This code it’s easy to unit test. It’s easy to read.

It’s error zone compile, which SQL query don’t.

And it’s two, four, six depending on what your situation is. It’s many, many times faster. It’s actually much, much, much faster because record wise code that looks like this makes the spark evaluator really happy, and it still creates objects datasets, or I could create a data frame if you want in order to do a column wise analysis after you generated your features.

But I fully believe on bringing your data engineers and your data scientists closely together as possible, especially when you have these complex pipelines where you have many, many models living within the guts of the pipes. And they have to be performant. They have to be good, and it absolutely is possible to have your data scientists writing production quality code. I know that because we’re doing it. So I’d like to share you … This is just a snapshot from our UI. It’s a property centric snapshot, and it’s not as snazzy as some of our predictive models, like our likely to sell model. It’s built off of the graph for our beta portfolios project. But what it does show is that is a property in San Jose.

And with this property, we’re able to bring together building and lot information. So this is environmental information, as well as structural information, bring it together to bear on individual properties

and in a easy to absorb way. We’re able to create ownership information. We have an ownership brought down and a tenant brought down. It’s the same company information with different edge types. And then we’re also … because we have our property resolved, it has a unique ID. It’s very easy to attach the financial histories of for sales and taxes to this property. So even though we have a knowledge graph that’s supporting the guts, you don’t see the graph you just see clean easy to use products.

Okay. So today I’ve spoken about why we would use a knowledge graph. What situations would call for it what the components of the knowledge graph are and how they are.

Structurally,

you’re decreasing the number of structural components you need to maintain, which is of great benefit. We talked about some of the modern ways to go about constructing large scale knowledge graphs, know the engineering and product considerations would need to be during that construction.

I also spoke a little bit about the impact of cross functional teamwork and tools.

I hope that when I cover these topic areas today, I created a strong enough framework that you will be able to contextualize the information and decisions that you may come across. when you have the opportunity to build a knowledge graph of your own.


 
Try Databricks
« back
About Maureen Teyssier

Reonomy

Maureen is Chief Data Scientist at Reonomy, a property intelligence company which is transforming the world's largest asset class: commercial real estate. For about 20 years, Maureen has run algorithms and simulations on terabytes of data including: location, click, image, streaming and public data. Maureen drove technological advancements resulting in 500% year over year BtoB contract growth at Enigma, a data-as-a-service company, delivered models anticipating human behavior at Axon Vibe, and researched interactions between Dark Matter and Baryons at Rutgers. Maureen's Ph.D. is in Computational Astrophysics from Columbia University on simulating the cosmological evolution of galaxies.