2:14:39
2025-06-26 09:07:32
1:12:32
2025-06-26 09:11:34
6:42
2025-06-26 11:08:41
35:51
2025-06-26 11:24:37
38:41
2025-06-26 13:21:35
20:37
2025-06-26 15:06:35
51:46
2025-06-27 09:06:19
58:45
2025-06-27 09:06:25
36:01
2025-06-27 11:26:09
1:12:38
2025-06-27 13:45:09
Visit the Kubernetes Comprehensive 2-Day course recordings page
WEBVTT--> So I don't have to do anything on my end. --> All right. --> Let's go ahead and get this. --> All right. --> Well, let's see. --> Yesterday we left off in lesson six, and we were going through an ingress. --> We'll do a quick review here. --> So ingresses enable us to map our traffic to our back-in based on rules that we define. --> We utilize an ingress controller. --> and as well as an Ingress Manifest file. --> And it's being replaced by the Gateway API, which we're not going to use today because --> I don't believe you can install that fully on mini Q's. --> And then we talked about CNIs and how Sillium is a GNI that's useful on bare metal. --> It provides a lot of features out of the box and allows you to replace Kube Proxy. --> includes IPAM support if you want to feed it your own pool of IP addresses. --> It does not however provide load balancers and you still need an external load balancers such as Coombe. --> All right now let's see if you can get to we're on slide 29 here. --> All right, we created a cillium install and we deleted it and we'd seen what cillium installs --> Cillium is actually a good one to practice with. --> So now I'm going to ensure a fresh mini-cube environment, --> and we are going to deploy Cillium in an H-A mode with six-node. --> We want to ensure we have a fresh mini-cube environment first. --> Stop. --> I would probably do stop first. --> Okay. --> Now, do LS real quick. --> Oh, good. --> Yeah. --> So it didn't recreate. --> It's the same one that we left off with. --> I need to have to type all those back in again. --> All right. --> So how do you feel about your grasp of deployments and staple sets and replica sets? --> Yeah, I mean, Kubernetes has such a steep learning curve starting out, so just knowing what to review can be a huge time saver. --> Oh, interesting. --> Okay. --> Yeah. Oh, okay. Yeah. Oh, interesting. And that's, that's to get your certification? Oh, okay. Interesting. Yeah, I know some of the courses use a script to spin up K3S inside ECS. So let's spin up a little K3S cluster. --> All right. Mm-hmm. We do. Let's see if we can control --> see out of that. Let's just run all kinds. Okay, so what does it say here? --> All right, let's try to scroll up to the top. We may have been too many logs there, but let's --> try to scroll up to the top, I think, or here. Yeah, we can stop it and then start it, and then --> let's see right when it starts to fail. It'll do a simple mini-cube start, but if you do it, --> HA, we have to fix our cluster. --> Uh-huh. --> Yeah. --> Although I didn't expect it to actually try to join servers, --> usually it just fails out. --> So yeah, so there's usually a- --> Oh yeah, so usually it tells you kind of an idea --> of what happened and what went wrong, --> but this thing actually tried to join the Etsy B cluster. --> So that's interesting. --> We'll see if we can fix this and hopefully. --> So if you just start it back up, it might give us the error right at the very beginning. --> And then just do Control C. --> And let's see. --> It looks like it did three nodes, so MiniCube 03, so it started MiniCube M04, and that's when the error happened. --> So that one never started. --> So let's see what happened. --> Yeah, let's see if it'll tell us at M04. --> M-04 when it starts at M-O-4. Let's see if there's an error before it tries to join. --> Yeah. --> Okay, we're on two. --> See where it failed before. --> Stopping node. --> Okay, now go ahead and control C. --> All right, let's see what it says here. --> So we need to fix the Max files issue. --> But you can go ahead and... --> There we go. --> Yep. So three commands we need to put in there. And what we're going, yeah, so this is, because --> Minicube is using Docker and Docker and Kubernetes inside that final Docker. And so it, it exceeds --> the normal Ubuntu, DBN max files. And so assuming that the Ubuntu 24 instance that you are on is the same --> is what i'm on this should fix it that's correct let's uh cross our fingers if they didn't do anything --> else to this a bunch of 24 instance that makes it different right their own little distributions --> and i'll be right back so what happens cool let's um go ahead and check the pods you notice there --> maybe a few issues why do you think that is so he yep still settling out looks like we --> We have everything running now. --> We do. --> All right. --> So now we're going to check cilliums. --> So we're going to type Cillium status. --> All right. --> Yeah, so what are those warnings? --> So Hubble, yeah, so yeah, --> Hubble is a telemetry relay and it allows you to visualize --> the telemetry throughout your entire Kubernetes cluster. --> So it's very useful in troubleshoots. --> And when you start to have issues in your networking in your cluster, if you're using cillium, --> you can enable the Hubble relay and it has a UI and it gives you a graphical representation --> of what's going on inside the cluster. --> So they don't have that configuration. --> It's probably going to just keep throwing errors. --> see what else we see here operator is okay the envoy is okay and you can see where it says --> Hubble relay is disabled so Hubble won't be able to start up and cluster mesh is --> disabled so Cillian has the ability to add in a cluster mesh and join two clusters together --> So they operate almost as one cluster, but they're two completely separate clusters. --> All right. --> So what would we do if we were trying to actually deploy C-N-I and H-A-1? --> Let's see here. --> So the first, let's take a look at the pods. --> And what do we have here? --> we've got a cillium agent and a cillium envoy on each node we have a cillium operator let's do a wide --> you might have to widen your terminal to see this so it's not confusing um you might have to --> oh they won't i'd say i wonder if they'll let you string down the size of the terminal --> I guess not. --> And then do minus O wide. --> So do the same, oh, there we go. --> Let's see, yeah, there we go. --> So we have the cillium operator is running on mini Q. --> Which is our number one control plane node, right? --> We only have one cillium operator. --> So that's not in high available. --> And let's look down at our Cuvip, which manages --> our high availability. --> And we can see we have a kubev, which is a load balancer --> with vip running on all through the control plane nodes. --> So that provides a VIT to everything that needs to operate --> within the control plane. --> However, Scyllium doesn't have that VIT. --> So we would need to provide the kubbiv --> VIP address to Sillium for this work. --> We'd also need to enable the Hubble VL --> telemetry we would need to enable kub proxy replacement which means it would enable koup proxy --> replacement and silling and then we would remove kub proxy so you see the kub proxy pods one --> running on each node there enable gateway API for the replacement and you can actually --> run both side-by-side so you can run gateway API and Ingress the i-sun-free is here --> All right, there we go. --> Now, if you're using Gateway API, --> you'd need to create an HTTP route, --> or if you were using Ingress API, --> you'd need to create an ingress for the Hubble UI. --> So you can see the telemetry. --> And if you are installing Gateway API, --> you would need the CRDs. --> And Hollis is probably best done with Helm, --> rather than trying to use this implementation. --> of cillium. This gives you a representation of what it would look like when you go to install --> cillium. In fact, let's look at one of those cillium agents. You can pick any one of the --> agents that just say cillium hyphen and then the hash. Let's take a look at it. Did that work or no? --> Okay. It doesn't work in mine either. Yeah, I forget what happens on mine when I do that, --> but it always does something weird. --> Yeah. --> Yeah. --> Uh-huh. --> Yeah, C&Is, once you learn how to use them, --> they make your life a lot easier. --> All right, see what we have. --> You can pick out --> the bacillium. --> You can see that it's a Damon set, right? --> Yeah. --> You can see it installed it --> in the Cube System namespace, which is appropriate. --> It looks like MiniCube 06 --> was the last one created. --> and so their script behind the scenes started installing cillium before node six was up so rather than waiting until all the pods on the node were up to determine they just determined that the node was accessible but not ready and then they tried to install cillium on it so that would be a mistake in your scripting but still i mean what they've done here is pretty --> amazing from an engineering standpoint. --> We just have a little mistake there. --> So you would wait until all of your pods are up --> and then you put your cellium on all of the nodes. --> But what version are we running? --> No, it's right there in the event. --> We look at the container image --> where it says successfully pulled the image, --> it's telling us that in the script that they're running, --> they pulled up from Quaid, --> I.O. --> Cillium version 1.17.4. --> And they switched to Quay from Docker, --> probably because the doctor was charging them too much, --> or they had timeouts or DNS. --> A few upstream charge switched to Quay. --> .io from the Docker repose. --> All right, let's get out of that. --> Let's take a look at the controller. --> controller. There's only one controller. So you can tell this is not a high availability --> setup, even though we requested high availability. There's a concilium operator. --> This is what manages all of the individual agents and make sure that the agents are running --> and spun up properly. --> Kubernetes use as an operator principle. --> And you'll see that, especially with staple sets, --> so when you're running a stateful set, --> there will be an operator in front of it as well, --> so your databases. --> Most of your mainstream databases have converted over to, --> typically they use go, but they're go operators --> behind the, you know, under the hood. --> And that manages spinning up your tenants. --> So you have an operator and then you have tenants --> So if you recall, when we studied namespaces, we had a minio operator, right? --> And then we had menu tenants for Loki for the logging, and then we had a minio tenant for GitLab. --> And so this is a similar concept for an operator, and it spins up the tenants, which in this case would be cillium agent, and we'll see what it does here. --> So it has a replica set of one, right? --> Let's see here. --> It's controlled by a replica sub. --> Liveness and readiness probes. --> Very important for Kubernetes to have --> livenous and readiness probes. --> Let's continue down through. --> Everything is true. --> So we don't see one replica. --> We just see it's controlled by a replica set. --> And so this is an individual pod. --> And so it just says it is this individual pod --> is controlled by a replica set. --> And we can look at that replica set in a minute --> minute when we get out of this. Did you see that up there where it was controlled by replica set? --> Okay. And then you can see conditions are all true, so this thing is in good shape. And --> the versions should be the same with cillium, with the exception of, I believe, Hubble. So when you --> use cillium in your own cluster, if you decide to, you'll see the versions are pretty much the same --> except for when you get into the Hubble aspect. --> All right, so we can go ahead. --> Let's take a look at the replica sets for all namespace. --> Yeah, there we go. --> So just desired one, current one, ready one. --> That's all they gave it in the config. --> And then let's take a look at the pods again. --> We'll look at one of the envoy. --> What's the difference between the operator and the envoy? --> Correct. --> So when it spins it up, if you think internally, --> how the script is deploying it. --> What's the difference between the two? --> Correct. --> Yep. --> And then the Damon set is set so that whether you do HA mode or not, --> it just does one, or the operator is set so that whether you do HA mode or not, --> it only installs one. --> And so they're probably using the same script for every installsillium so that you have --> one operator. --> And let's see if you can find anything else in here. --> So in this case, Envoy has everything's true, --> got lightness, readiness, and startup probe. --> Notice that? --> You don't see startup proof too often. --> You see lightness and readiness. --> And sometimes you'll only see readiness, which is not good. --> You want to see lightness as well. --> Startup provides that additional measure. --> Yes. --> So readiness means it's ready to accept workloads. --> And liveliness means we're alive. --> It's just basically peeing. --> Hey, are you there? --> Yes, I'm here. --> Okay, great. --> So I can still keep sending, --> I can still keep sending workloads to you. --> Readiness is, I'm not ready yet. --> I'm sending anything. --> Okay, I'm ready. --> Now we go to liveness, right? --> And then startup is as it's starting up. --> And what kind of time out do you have here and start up? --> Okay, five seconds, time out one second, three, two. --> They will hear 100. --> Does that say 105? --> I'm not having it. --> Yeah. --> Okay. --> And that's because, you know, wow. --> Yeah, they have a lot of failures on starting this up before it's ready. --> So that's not unusual, by the way. --> Somebody actually tested it out and said, --> That was the number that worked. --> So this team actually tests. --> That's a good sign, by the way, when you see that. --> So I mean, say they have a complete team. --> So all right. --> Yeah, so if the startup, it'll wait and then it'll just go into a failure and it'll tell you startup probe failed. --> And then it'll try to restart it if that's the way to this set up. --> It's a game and set. --> So it may kill the pod. --> And then start up another. --> So let's see if there's anything else good in here. --> What do we have here? --> We have a config map. --> Look at that. --> We have an Envoy config. --> We'll check that out here. --> And that's we have Qube API access with a token. --> And gave it a token to be able to access the Kube API. --> So that's the Kube Group C-A-SERT, which is probably in, it's probably using the Kube system namespace. --> so it automatically has access to that. --> Node selectors, well, it's pretty basic. --> It just says, you know, you can install on any nodes as Linux. --> If it says Windows, it's probably, okay, what do I have here? --> Node affinity, four did not satisfy. --> Okay, so it says four didn't satisfy node affinity because it already had one on the first four. --> So the fifth one was spinning up. --> This must be five. --> Is this five? --> Can we scroll it to the top? --> This is on five. --> It should tell us what node were on. --> Node Minibube, yep, 05. --> So it says, okay, the first four already had a daemon set running. --> Already had a pod. --> So that's not available because note affinity says we can only run a one. --> And so it says, so the fifth one is where it wanted to install, however, it was not available yet. --> And it started out because it didn't have any free ports or the requested pod ports. --> So the reports weren't up yet on the note. --> And if we look at the version, we see this is not 1.17 because this is envoy. --> And not everyone installs an envoy, but for some reason, the engineering team crew, --> MiniCube, they have an envoy. --> All right, and we're going to look at the daemon set. --> And we can see, yep, so we'll describe the daemon set. --> Wow, it had to create those pods quite a few times, didn't it? --> Do you notice that? --> All right. --> So what we have is a label. --> Syllium Envoy part of Cillium. --> They're using a deprecated demon set template. --> So they might be using a newer version of the image, --> but we may be working with an older script on this. --> Okay, and let's see our lifeness readiness. --> See if they have a startup probe. --> And there's the startup probe with the 105. --> We have an environment. --> This is K8's node names, so in K8's namespace. --> We have mounts. --> What do we have mounts here? --> We have BPS mounts, config. --> Let's see, we have a config map. --> All right, so we have a config map, --> and what's the name of the config map? --> So it's saying, I need to read this config map. --> silly monvoy config and where do you think that config map is located what names be yeah --> that would be my guess as well all right let's scroll down to the bottom and look at our events --> see how many times this thing we have here so we created three pods and we deleted a pod --> and we created three pods and we deleted three pods and we deleted three pods --> Created three, dated four, created three, delete two, created three, deleted two. --> And finally got six out of all of that. --> But that's what it's designed to do. --> That's part of the self-healing and everything that it needs to do when it's spinning up. --> So that's part of the magic that makes Kubernetes so great. --> Okay, let's take a look at the pods again and see if they're all still running, --> except if any of them have crashed now that they've been running for a while yeah we have a few that --> restarted but um not too bad all right yeah so so you can see here selling a little bit --> complexity to it but um it will take over for you notice you know something that's missing in this --> cluster what what isn't a normal single node cluster when we spend at a byte so we think back to what --> what is typically shown in our pods when we just do a mini-cube start obviously nowhere near the --> number of pods that we have showing right now but what is normally in there that's missing --> remember kubnet remember kubnet yeah kubnet is the mini-cubes c and so it right because we're we're --> using cellium so okay so in lesson six we learned how ingress work how ingress is work how ingress --> enables clients to access workload endpoints, --> how ingress uses an ingress controller, --> how ingress is managed through the Kubernetes API. --> So remember, we use the ingress file, right? --> And we used Coup Control, and that sent that --> to the Kubernetes API. --> And how Ingress API is being replaced by Gateway API. --> And when you switch to Gateway API, --> with an existing workload setup you must convert all ingress kind which is the --> kind and then ingress to Gateway API when migrating so it requires --> CRDs to be installed first for using the Gateway API so those get installed --> before CILUM would be installed then CILUM would read those CRDs when you --> enable the Gateway API flag. --> How a Gateway API relies on a gateway and an HTTP route. --> How a Gateway can share many HTTP routes across Namespea. --> Now a gateway provides greater flexibility, standardization, and scalability. --> And it's anticipated to be released later this --> here has one more experimental feature that they are testing out relating to TLS routes, which --> very few practitioners actually eat. --> We talked about CNIs, CNI plugins are used for cluster networking, used to manage network and security --> capabilities. --> So that enables you to use, for example, CILAM will enable pod to pod no-digit --> node encryption. So if these nodes were spread out across multiple bare metal instances with a --> network cable running between them, it would encrypt the traffic node to node and pod to pod, --> node to pod, node to pod, and pod to node. That would transparent encryption. So that way you don't need --> a TLS cert when you're communicating into the pod directly to the container anymore. --> The Kubernetes method is that we've terminated at the gateway now, and then we use pod-to-pod node --> to node encryption with Kubernetes native, which saves a lot of time when you're managing, you know, --> hundreds and hundreds of containers with TLS. --> How a CNI can be used to deploy a gateway API. --> We just got CILUM as a CNI, and how CILM is networking, observability, and security solutions. --> and the observability is through Hubble, --> so you have to actually enable the relay, --> and then you have to also enable the UI as well. --> And Sillium works well on bare metal with Google. --> All right, let me go ahead and start. --> Lesson seven. --> So lesson seven, we're gonna learn how --> to define computational resources using requests and limits. --> In Kubernetes, resource limits are crucial --> for enabling efficient resource utilization --> and to prevent resource starvation. --> If resources are constrained within the control plane, --> for example, the Kubernetes API server --> or XEDD resource may become unavailable. --> If insufficient resources are not available --> for a particular node or pod, --> then the resources may become unavailable for use. --> Available resources are defined within the node status and may be accessed using the Qube Control CLI. --> Node resources consist of CPU, memory, storage, and pods. --> Node requests enable pods to reserve a specific resource and ensures its availability when needed. --> Node limits, so different than a request, node limits. --> define the maximum resources available to the pods on the node. --> Total node limits may exceed 100% of the available capacity. --> So if you have 12 CPUs, you can set the limits for everything running on that node to 15 CPUs. --> This concept is based on the realization that not all pods --> pods will hit their limits at the same time. --> The kublet monitors note resources and will proactively terminate pods to reclaim resources when --> a particular resource is under pressure. --> The kublet can fail one or more pods in order to reclaim that resource. --> The kublet will set the pod status of the evicted pod to fail and then terminate the --> pod. So when you look at your coop control pods, you'll see it show it's failed. And sometimes --> you can't even see it. It'll just go straight to terminating. So, and then it just disappears. --> The kublet will attempt to reclaim resources before evicting pods, such as when experiencing --> disk pressure. And the kublet will delete old unused container images first before evicting. So if you --> experienced disk pressure it'll try to delete all the container images on that --> first and if you remember back to the example where we were installing a deployment --> on all a Damon set on all three nodes and it took longer and that's because it had --> to download the image onto each node and so those images stacked up after a while --> they're never you know deleted or garbage collected and so it looked at we and then --> one area in which Kubernetes cluster operators experienced node pressure --> is with the disk filling up on the node due to unrealized log collection without a proper log --> shipping and rotation process. This can cause failures in a production environment several months --> after a Kubernetes cluster has been provisioned. That's one of the areas that can get you. --> All right. Oom Killing. Killing a process due to out of memory can happen for both node --> processes and pods. Oomkilling a process or pod is usually due to unavailable resources or --> constraints. When upgrading containers with Helm charts to a new version with tightly constrained --> memory limits, it is not unusual to experience umkills for the pod that was just upgraded. --> This is one of the reasons that upgrades to be tested on a production-like cluster before deploying to a real production cluster. --> And this often happens because the upstream maintainer, the team, whose resource you're using, forgot to test in a production-like environment before pushing to production. --> and what will happen is you will go to run it in your cluster and all of a sudden you're being --> oomkilled and it will oomkilled doing startup or it will run for a week and then it'll start --> umkilling and that's because the upstream team forgot to test it right right so they use kind --> so a lot of upstream CNCF teams that are funded through CNCF they use kind which is different --> than the kind on the YAML template and so kind is Kubernetes and Docker and so CNCF provides --> a lot of templating to teams and so they just spin up and test it in kind it pulls it in as --> says all pods are running you completed your test pretty basic and then okay let's it pushes --> automatically to production so new helmets are typically container resource requests and --> limits are optional and we don't have to provide that information in our --> templating or manifest files the most common resource to specify our CPU and memory --> both CPU and memory may be requested and limited in the container configuration --> that would be under the spec container and then down further would be your resources --> So when you specify a resource request, the Kube Scheduler will determine which node to place the pod on. --> So if you need five CPUs and you only have one node that has, you know, five available, that's what it'll be placed on. --> If you need 10 gig of RAM and only one of your nodes has 10 gig available, then that's what it. --> So the kublet handles the reservation of resource requests on a node. --> A container is allowed to consume more resources than requested if necessary. --> Container requests are said as follows. --> Spec containers and resources request CPU, spec containers, resources requests memory of the --> The Kublet enforces limits on each node. --> And the container may consume more resources --> than specified short term. --> Container is generally not allowed --> to consume more resources than specified over time. --> In the short term, it can consume more. --> So you might see a pod consuming more than you've allowed, --> but that's not allowed to take place for very long. --> It's just letting you know it's not. --> And so CPU limits, when you specify a resource limit, --> the Kublett will enforce that limit. --> CPU limits are enforced by throttling. --> Memory limits are enforced with oomkills. --> In a container that experiences a container resource memory limit --> may not be unkilled immediately. --> And how do you handle your limits? --> are said as follows, spec containers resources limits. --> So this is different than before where we saw requests. --> And then spec containers resources limits memory. --> All right. --> Determining the correct setting for requests or limits --> can be accomplished in multiple ways. --> The first method is to review the documentation --> for the application running inside the pod. --> Typically, engineering teams who --> Managed provenage products will post a minimum set of requests and limits, along with advice on if certain limits such as CPU should be avoided. --> In other words, they may experience a temporary spike in CPU, and if that happens, they recommend not setting a limit because it's temporary. --> It may only last a few seconds. --> It may happen during startup, for example, and they don't want it to throttle and then fail. --> I think that's because the team has requested their product extensively. --> So you want to look for advice to avoid the CPU limits in the documentation. --> And then in the absence of good documentation on resources, the next best method is setting --> a practical resource for both requests and limits involving CPU and memory. --> Using your favorite deployment method, dart up the pod with a container and monitor both --> the events and logs for any error messages adjust the settings is necessary to eliminate startup --> issues first and then proper monitoring of the container logs is necessary to monitor and --> troubles through any issues that may arise after the initial startup sequence is finished --> so there's two steps involved when you're determining requests and limits name your --> startup you get to your startup and then you monitor it over time it's not a right --> practical application sure you have a fresh mini Q profile and you're going to --> get the node resources so once that comes up let me know we'll be right back --> wow what's going on with your disconnected up there that's weird I don't think mine --> does it at all have you ever seen that pop up online so when you're not in session --> They only allowed the student to run for 15 minutes. --> Have you noticed, is that every 15 minutes --> that it does that or? --> I know if it's a bub. --> Yeah, no one, okay. --> I know when we were setting up the VMs, --> we had to hurry because they were only up for 15 minutes at a time. --> Okay, VM is staying up, --> but it's disconnecting the network and reconnected. --> All right, we're going to look for it this. --> Well, there's a maximum capacity. --> capacity for CPU. --> And where do you find that at? --> It's under the Capacity.C.U. --> What are the allocated resources for CPU? --> Yep. --> Mm-hmm. --> Scroll down just a little bit. --> There we go. --> How much do we have allocated? --> So 750M would be assuming that 1,000M is a CPU. --> That our request, we have three quarters of a CPU. --> and limits are ZY. --> Which pod has the greatest CPU requires? --> Yeah, the most important one in the whole thing, huh? --> And what is the disc pressure condition? --> Yep, no disc pressure. --> And if we look, what is the memory pressure? --> And then how about the Pids? --> That's when it gets people. --> They don't even realize it's possible. --> Yep. --> And which maintains and reports them? --> So what process maintains --> reports on pressure. --> So what handles that process? --> All right. --> Looks like Neil used a 16 CPU set up with, --> how much memory do we have there? --> 12.2 gig looks like, all right. --> So we may create a new deployment file --> using the EngineX app, but we're gonna copy it --> to an EngineX app, Limits. --> YAML file. --> All right, so what are we doing here? --> 12 megabytes is correct. --> Okay, let's deploy the engine X. --> Yep, so we can see we have a memory and CPU. --> Now let's go back to pods. --> Okay, so we'll go back to pods and we'll see what's going on here. --> Let's take a look at one of those. --> We can figure this out. --> Mm-hmm. --> So being um-kill, but it's not telling us why. --> So let's scroll down to the bottom. --> down to the bottom. So successfully fold the image. I've created it now five times. It's already --> present. It's just this back off restarting failed container. See if we can look at the logs. You can get --> the logs for that container. All right. So let's go ahead and stop that and delete that. No, --> just delete the deployment. And we're going to modify that file. No, it's not telling us, well, it's --> being why it's being um killed is telling us um kill so we're going to see if we can --> prod it to tell us something oh let's see we didn't delete it yet try it try to oh do you delete --> it okay so let's let's modify this and that's correct to one main deploy it and we'll get the --> node resources yeah and are the pods running let's check the pod --> appointment zero of three right yeah so let's look at the pods again we have a --> container create error so subscribe the pod there's a minimum memory on this --> just failed to create pods sandball okay down at the bottom it says container --> init was um killed memory limit to low this is interesting --> a little bit different setup for some reason your mini cube is treating this slightly different and mine does --> so when i run this it actually tells me what the minimum is on mine um so that's interesting --> it just says unknown on yours and instead of saying unknown on mine it says minimum six in i --> is yours wouldn't run even with 12 am i which is quite interesting okay so um let's so let's set it to um --> let's try to uh check the log so we have no logs right and after an hour the events are --> defeated so if this were setting here after an hour the early events would be deleted out of here --> okay so let's close that out i'm going to go off the script here because for some reason your mini --> Q was acting differently with that deployment. --> Well, no, so what it says is it provides it where yours says unknown under almost the last --> one, second to the last up above pod sandbox changed. --> Yes, where yours says unknown, mine says minimum memory is six in mine, but we tried it --> with 12 and it didn't work either. --> So let's go ahead and delete that deployment. --> Let's change it to 100 in mind. --> So we can figure out what's going on here because this changed in two days. --> something that's go through it again all right so something changed in the last two --> days with this deployment and the last day with this deployment and it it I wonder --> if it's the version is the same that's weird okay didn't try it yeah oh well let's --> take a look at it first though describe it while it's running and let's see what --> it said um yeah let's look at the events down at the bottom again --> Yeah, so that's, yeah, everything looks good there. --> Okay, so go ahead and delete that and try it. --> You said you wanted to try sick. --> Oh, there we go. --> All right, let's take a look at the oomkill. --> We know for sure had an oom kill. --> We know that there will probably be a good event in there. --> Well, it didn't give us the message on the RAM. --> Well, that's interesting. --> So, when I run this on my mini-cube, it actually gives me the event on what's the --> the event on what the minimum requirement is so that was interesting and you're running yeah i --> see up there memory six in my CPU 250 that nothing in there check some configuration complete ready --> for startup and set one two three four five six processes and then it's it's it's umkilling --> before it gets any further interesting because it does a engine X automatically uses um it --> It reads how many CPUs available in the node. --> In this case, there's 16 CPUs. --> And so you should see Start Worker Process. --> There should be 16 of those. --> And you only see one, two, three, four, five, six. --> So that's interesting, because then the crash loop back off. --> So yeah, no read out on the minimum. --> All right, it's a nuance of your, --> and in that case, crash with back off, --> There are generally no logs if it stays in container creating. --> So container creating wouldn't have a log, --> and then when it goes in the crash, --> we back off you may not have a log. --> So, okay, just a second here. --> All right. --> So we're going to query the node. --> We're going to delete that and query the node. --> We're going to check and see what is allocatable for CPU and RAM. --> And then how much RAM, 12.2 gig? --> Yep, okay, so now, and then --> how many pods were in our last deployment, the EngineX app, how many pods did it deploy? --> There we go. Okay, so let's do this. Let's create a new EngineX app, and we're going to call it, --> so we just copy it over, and we're going to call it EngineX app dash request. All right, so this is --> designed for a 12 CPU cluster. So what we need to do is we have three deployments. So we're going to --> set the CPU limit to what the limit first set the CPU limit to do three times three that's going to be --> nine CPUs we do three times three that'll be nine big of memory right let's go ahead and set the --> limit now let's go ahead and do this let's set the limit to three for CPU so set that is --> 3.0 and then from memory set that as 3,000 MI and then for requests that the CPU to --> 7.0. Neil has this slightly different setup for you than what I used so they need to modify --> this here in a little bit. Make sure that's correct. We're going to deploy it. All right so what we --> have here is we had limits of three and requests of seven on CPU right and it says --> seven must be less than or equal to the limit. --> So you can't request more than your limit. --> Right, if you limited to three, yeah, absolutely, yeah. --> So in the same with the RAM, so now we're gonna modify. --> So we're gonna, so a request is a reservation. --> It's saying, hey, you know, it's like reserving a hotel room. --> Hey, I need to reserve seven CPUs and four gig of memory. --> And the node controller and the coulid says, --> okay, I've got seven CPUs available and four gig of memory. --> And if it doesn't, it says, hey, I have no nodes to put you on, --> you'll have to wait until I have a node available. --> And so it will loop and move and move and loop until a note is available. --> And so limit is, you know, okay, you've requested it, --> and then now my max limit is, --> is whatever we have there. So in this case we're going to modify this so that they both say the --> same thing. CPU 7 and memory 4,000 on both. It did not, no. And it gave you actually a very --> good verbose, you know, error message. So and that's where Coup control actually sometimes --> does provide very verbose mess. All right. That's the first step to troublesweet. --> Mm-hmm. Yeah. All right. And what is the next step? Yep. Because it hasn't started up yet, so there's no logs. So now what's the next step? Yep. Yeah. Let's take a look and see what that looks like. So what do we have running on there? Yep. And how much are they consuming of our requests? 92? Yep. So I see that. So if you look at allocated resources. Yeah. Yeah. So 14,750. So 14.75. And limits were. --> individual pod correct correct and so this is saying we're trying to deploy pod number three --> but hey 92 percent is already requested 87 percent um which doesn't matter the limits can go over --> a hundred as you can see and we have plenty of memory available although we are maybe a little --> short um so it's telling you we're full there's you know we're going to modify the limits and you can go ahead --> take that down yeah all right delete the deployment let's modify and what should we modify --> so let's see we have 16 so let me see here we have 16 available and normally we have 12 --> when we do a mini-cube cluster so we have four additional available so normally we would --> change the request to cp6 um two --> or for three. Let's see here. --> Let's go ahead and change it to six. --> Six and see. --> Go ahead and apply it. --> Let's run through. --> We can see it's still pending so we could --> go straight to the node. --> Check the node resources. --> And what do we have for our allocated? --> It's lower. --> Okay, so let's change it again. --> Okay, yeah, let's try five. --> Let's check that one that's still pending. --> Let's check the pending pod, see what's going. --> Okay, let's look at the node. --> All right, so how much memory is requested? --> Okay, and let's go up and deduct that from how much is available. --> So we scroll up a little higher, --> and it'll tell us what the total allocated, --> located was 12-236, so scroll back down minus 81170, so 12-236 minus 81170. --> That leaves us with about 4.05, and on request, and for some reason, here's this saying, --> that's not enough. --> So it won't let you go to 100% per requests is what that's telling you. --> Okay, so let's change again. --> Again, your mini cube cluster is two days newer than mine --> and yours is slightly different than mine. --> So yeah, and you'll run into this with Kubernetes clusters --> where that's the nuance of two different clusters --> spun up on two different, you know, BMs. --> Yep, two different clouds. --> Okay, so we're going to change the requested memory. --> Let's change it to three days. --> So it won't let you go to 100% for resources or requests. --> All right, boom, all three run. --> Let's take a look at the nodes. --> And when you see there, well, we've got 98% allocated on the CPU, right? --> So, yeah, so is there any room left for more deployments? --> Yeah. --> And so this is what mine would look like, but in general, you shouldn't have any more than 90% of a node requested on either one of those CPU memory. --> And so as you can see, it wouldn't even let us do 100% with everything running. --> So it killed us at about 99 right. --> So we're going to insure a... --> Yeah, yeah. --> And Minicube is actually great for testing concepts out there. --> It just has some limitations. --> It's a great Docker engineering magic. --> Oh, interesting. --> Oh, interesting. --> It's restarting the network. --> All right. --> So, question. --> Can requests be greater than available re-bring? --> Can limits be greater than available? --> All right. --> So unless, what's that? --> Correct. --> That is correct. --> All right, we're going to do a review, and then we're going to, --> you know, actually, you know what? --> yeah it's 11 o'clock we'll go ahead and take a 15 minute break come back at 1115 all right went a little over there --> it took a little longer than I thought um so we'll take a 15 minute break come back at 1115 and then we'll --> do our review and go to the next list see in 15 let's go ahead and review so in lesson seven we learned --> about node resources and limits and how constrained control planes can affect the Kubernetes API server. --> How available resources are defined in the node status. --> Node status is accessible by describing node using the group control API. --> Now node resources consist of CPU, memory, storage, pods, and also Pids. --> Now requests enable pods to reserve a specific resource. --> resource, requests ensure a resource is available if needed. --> Node limits define the maximum resources available to pod. --> Total node limits may exceed 100% of the maximum resources, and that is due to the fact --> that most pods will not experience 100% at the same time. --> Kublett monitors and node resources for node pressure. --> Kublett will proactively terminate pods. --> Kublett can fail one or more pods to reclaim resources. --> Kublett will reclaim resources before evicting the pods. --> And it will do that by deleting old container images first. --> Learn how node disk pressure can be caused by unmanaged logs. --> And why a proper node log shipping and rotation process. --> is important. Boom killing a process can happen to node processes and pods. So if you run into an area --> where your pods are working, but your node is unresponsive in some aspect, if you can still access --> it with the Kubernetes API server, you might see that a process has been killed inside the node --> itself, which you don't normally interact with. But you may see that under your --> events. Boom killing can happen due to unavailable resources or constraints. This can happen when --> upgrading a container with tight constraints. Upgrades should be tested on a production-like cluster --> first, and how upstream maintainers forget to test before pushing to production in a production --> like and buying. Actually more common than you would think it should be when working with Helm charts, --> by the way. Container resource requests and limits are optional. Most common are CPU and memory. I've --> actually never seen kids may be requested and limited in the container config and here's a new one --> we didn't go over this because it's in beta that's not enabled in the cluster but you can also --> request and limit it in the pod config so you can think of what it might be a pod can contain multiple --> containers right and so if we can request and limit for the pod sometimes we can't have --> add it in for the container so we might not have access to those configs but if we have access --> to the pod configs we can add it in there or we can just set the request or limit for the pod --> only and skip the container so there are certain situations where that might be advantageous --> somebody requested it and it's now in ebbly beta coup scheduler determines which node to place the --> pod on and the kublet handles reservation of the resource so the kublet runs on each node and then kub --> scheduler runs as a pod which handles the schedule containers may consume more resources --> than requested kublet enforces resource limits container may temporarily consume more than its limit --> containers may not consume more than its limits over time however it might be a delay --> CPU limits are enforced by throttling, --> memory limits are enforced with oomkills, --> and oomkilling a container may not happen immediately, --> although as we saw when we practiced it, --> it looked pretty immediate to me. --> Determining the correct resources can be done in two ways, --> review the documentation for recommendations, --> setting a practical resource, --> both requests and limits, --> Monitor the events and logs for any error messages. --> Adjust as necessary first to eliminate startup issues.