Lower latency through the right data model with Azure Cosmos DB


(ambient music) Hello and welcome. Please enjoy this short Readiness video on Azure Cosmos DB. Good morning, Thomas, how’re you doing today? Hi, Sanjay, I’m great. It’s great to be here with you today. Of course, same here. So can you please tell us more about your role at Microsoft? Absolutely, I am a program manager in the Azure Cosmos DB team. Fantastic, so what’re you going to talk about today? Yes, so in this session, I would like to cover some data modeling techniques that we usually apply when we design a data model for Cosmos DB. We are going to look at co-locating entities, embedding, denormalization, and even, most importantly, we are going to look at the performance gains you can get when you’re applying search techniques. All right, so as you know, I am a relational database guy, I wrote SQL soft for years, and Cosmos DB is a non-relational base. So what does it mean in terms of data modeling? For me, it was anti-relational diagrams, entities, form keys… So what it certainly doesn’t mean, is that you don’t need to care about data modeling. Proper data modeling is always very critical where you want to achieve the best performance for any kind of database. And Cosmos DB is no exception. What’s important to understand is that the different modeling techniques and best practices you are going to apply when working with Cosmos DB are pretty different from the best practices you used to follow when working with relational database. So it’s very important for our users, coming from a relational background, to really understand what it means to model data for Cosmos DB. All right, let’s get started. Sure, so let’s look at a simple example. Here I have a very basic relational diagram, showing two entities. I have a customer on the left, and each customer can have multiple addresses on the right. That’s your very typical one to many relationship. Of course.
Right. So let’s head to the Azure portal, and let’s see different ways to actually model that data. So here I have a first version where I have modeled my data So what tool are we looking at? Yes, I’m sorry. So this is the Azure portal, I’m looking at the data explorer that is embedded in the Azure portal. This let’s me explore my customer’s DB database directly from the portal. Okay, fantastic. As a first version, I have modeled my data over two different containers, and that’s probably how someone, coming from a relational background, would have done it. Because you would have done it with two tables. Yes, right (chuckling) So you have a first container for customers, and as you would expect here, each document represents one customer. And I have a second container with addresses and, of course, each document here represents one address, and referencing the customer ID here. Of course, one to many from customers to address. That’s correct. Now if my requirement was to fetch one customer, and all the corresponding addresses, I will need to issue two different operations, because we are a non-relational database. And because my data is spread across two different containers, I have to do two steps. Let’s look at that. I’m going to use our brand new Jupyter notebook integration, which makes it very easy and convenient to explore data directly from the Azure portal. So first I’m going to retrieve the customer, and I’m going to that through an operation that we call a point read. So basically I’m going to pass the customer ID to customer’s DB, and I’m going to get back the corresponding document. Then, once I have the customer, I’m going to fetch the corresponding addresses, and I need to do that as a second step by issuing a full query against the second container. So let’s do that. As we can see, I get my customer, so that’s my first document, and then I get one, two, three addresses that match. One customer, three addresses, and three documents in our way for the customer. Absolutely, absolutely. So let’s look at the latency, because that session is all about performance. The response time here, the total response time, was about 12 milliseconds, which honestly, is not that bad. It’s not the kind of latency that we should be ashamed of, but we are Cosmos DB, we can do much better. All right! As a second version, what I have done in order to optimize my data model, I’m working with just one single container. In that single container, I am storing both my customers, with Type customer, and my addresses with Type address. So here, what I’m doing, I am leveraging the schema-agnostic nature of Cosmos DB. We are schema-agnostic, which means that we don’t have to predefine any schema before ingesting data. We’ll just happily ingest any shape of JSON that you will write to the database. The typical situations where you want to do that, where you want to mix different entity types into the same container, is either because those different entities share the same access patterns, it’s the case in our example. Whenever I fetch a customer, I also want to fetch the addresses. Or maybe because those different entities share the same partition key.
Yep. So now that I have mixed both my customers and addresses in the same container, I can more easily fetch one customer and the corresponding addresses with just one query. I will just do a Select star from C, where the customer ID matches my customer ID. A little trick I’ve applied here to make sure that the customer is the first result coming from the query, I am ordering the result by descending Type. And because customer starts with a C, address starts with an A, the customer will be the first one to come back. So let’s execute that. It’s a nice group in a way! Yeah, yeah
(Laughing) Tricky. Absolutely, as we can see here, I’m getting my results. I’m getting the customer first and I’m getting the three addresses just after that. And that’s interesting, because we usually say with non-relational databases, you cannot do joins, but look at me. I just did a join over customers and addresses. Certainly you cannot do joins across different containers, because we are non-relational, but if you need to do joins across different entity types, you just co-locate them in the same container, and you can joins. You can do joins as I just showed here. Let’s look at the response time. This time we are down to nine milliseconds. From 14?
From 12. 12, yes from 12 to nine. Which is better, but not enough for me. So let’s see how far we can push that thing. Now that we have mixed customers and addresses in the same container, the next step would be to actually store each customer, and its corresponding addresses in the same JSON document. And that’s what we call embedding. Embedding is very suitable when you have a 1:1 or a 1:few relationship between your entities. That’s our case, right? We have a one to few relationship between a customer and the addresses. There is a limited number of addresses that relate to one customer, so it makes sense to put them together in the same document. It’s also very suitable when those different entities are queried or updated together. Which, once again, it’s our case here. So that’s my v3. In my v3 I still have just one container, but now when we look at my documents, I can see that starting from the original customer JSON document, I have added an addresses array directly in the document. This array actually holds the different addresses for each customer. Now the same operation I did before, just fetching one customer and the corresponding addresses can be done even more efficiently by just doing a point read. I just passed a customer ID and what I get back is one document containing customer data and the addresses. Little drum roll! What’s the latency now? I’m guessing- whoa that’s- We are down to three milliseconds. I was guessing six milli, but, yeah that’s good. Even better! Even better, this one. What we see here is that by following just a set of basic best practices, I have improved my latency by 75%. Right down from 12 milliseconds to three milliseconds. So I have a live question for you. Please!
I’m putting you on the spot. What if you have a scenario where you have limited addresses per customer this time? But what if you have thousands of them on the many side? A customer had many for thousands of something else. Absolutely, so if you’re in the one to many situation, especially that many is unbounded. There is no upper limit to how many addresses you want to have, then you would keep things separated. You would keep customers and addresses just referencing each other in separate documents. Okay but still, we’re efficient because of Cosmos’s DB architecture, I’m sure. (laughter) So what’s next, what are the, some other techniques? To illustrate another technique, I have another quick example to cover. Also a basic one. What I have on the left hand side is a product entity, and on the right hand side I have product tags. But this time I have a many:many relationship between products and tags. Each product can have multiple tags, each tag can reference multiple products. So once again- Most tricky scenario, many to many? Not that tricky, actually, as we will see. Heading back to the Azure Portal. Let’s see how we can model that. Well, I need a container for my tags, and so that container will contain one document per tagging. Here I have to awesome tag, here I have to practical tag, etc. And then for my products, what I have done as a first version; I have embedded an array of tag ID’s directly in my products. Okay So those ID’s basically are kind of pointers, pointing at the actual tag data that sits in the other container. Now, if I need to fetch a list of products, how would I do that? Going back to my script. I will start by issuing a query that will fetch back my different products. That’s just returning the top 5 products from a specific category, but that wouldn’t be enough, because what I would get back from this query is a set of products containing just tag ID’s, but my end users, they don’t care about tag ID’s, what they want to see is the actual names of the tags. So on top of that, what I will need to do for each of the products that I have fetched, I will need to issue an additional query that would fetch the name of each of the tags from the ID’s. So let’s do that. And I get indeed, five tags, and for, five products, sorry, and for each them I’ve aggregated the tag names, coming back from the additional queries. That was a lot of work just to get a small set of products. And as you probably imagine the response time will not be that great. More work, more yeah… Yeah that was 63 milliseconds. Again, it’s not catastrophic, but for very simple operation like that, we would expect that thing to be much quicker. The way to go in that situation, switching back to my slides, I’m going to add the label of denormalzation. Usually the rule when you work with non-relation database, is that you try to look for ways to optimize your data model to make sure that all the data that you need to serve is ready to be served. Yeah okay So in my example, whenever I return the product, I need to make sure that all the names of the tags are also embedded in the product so when I read the product, everything is ready to be read. And so that’s my v2, here. In that version, what I have done, when you look at the products, instead of storing an array of tag ID’s, I have directly copied the actual tags, ID and name, directly in the products. That’s what we call denormalization. We have duplicated data from the list of tags directly in the products. Just like they have in different areas. Absolutely, absolutely. And that yields very nice benefits in terms of latency and overall performance, because now, going back to my notebook, when I want to fetch a list of products, I just have to issue single query, and when I do that, I get all the data I need in order to present the data to my end users. The product and the name of the tags, and as you probably imagine, the latency’s going to be much better in that case. 40 milliseconds? We’re down to 9. 9. Okay, cool. Very cool. All right fantastic, any other closing thoughts? Something that maybe you haven’t thought about is in that scenario, what happens if I’m renaming the tags? Oh yes, that’s true. This is going to work nicely until I rename a tag.
Yes If I don’t do anything else, as soon as I rename a tag, I will get some discrepancies between that actual tag lists and denormalized tag names I have put in my products. The good thing is Cosmos DB has exactly the right feature to do that. What we can do is to use the Cosmos DB change feed. In our example, we would consume the change feed of the tags container, so every time there’s is an update in that container, that would fire an Azure Function, that would then propagate that rename to all the corresponding products, and that’s exactly what’s running on my database here, I will finish on that. Yes Again, if look on my items here, I have, for example, a product that has the tag ‘Practical’ Yeah So if I go to my tags, and I rename that tag, to do that I’m going to rename it from ‘Practical’ to ‘Very Practical’. Very Practical, okay. (chuckling) I’m hitting save, and what’s happening right now, this is firing an Azure Function, getting that data and propagating the rename to all corresponding products. And now if I go back to my products, by now it should already be done? Yes, the product has been renamed to ‘Very Practical’. So I still get consistency across my data set To close, some key takeaways. The first thing that I hope our viewers are taking back home is to understand that when you’re working with Cosmos DB, or any non-relational database, you have first to understand your data access patterns, and then you work backwards from those patterns to optimize your data model. In our example we needed to return the customer, and the addresses, and we optimize the data model just for that request, to make sure that that request was efficient. There are different ways to materialize relationships between entities; you co-locate them, you can embed them in the same document, you can also add levels of denormalizations across your data sets, but what’s good to remember is that when you are applying denormalizations, you can use the Cosmos DB’s change feed to make sure that your data remains consistent across your data sets. Fantastic. Thank you so much for this nice mix of diamonds and slice. I like that. My pleasure! Thank you Sanjay Thanks for watching this short Azure Readiness video.

Leave a Reply

Your email address will not be published. Required fields are marked *