39:11
2024-05-06 08:57:50
3:11:28
2024-05-06 10:30:19
24:39
2024-05-07 07:20:20
19:49
2024-05-07 08:03:25
1:14:35
2024-05-07 08:35:13
15:41
2024-05-07 10:06:25
38:33
2024-05-07 10:37:03
2:48
2024-05-07 11:19:01
59:37
2024-05-07 11:33:56
6:10
2024-05-07 14:55:25
39:40
2024-05-07 15:02:44
9:45
2024-05-08 06:44:21
29:27
2024-05-08 08:09:24
2:51:40
2024-05-08 12:09:24
Visit the Apache Nifi - GROUP 1 course recordings page
WEBVTT--> All right, Randy, get CSVL, add the name after me, read input seasons, figure long, no, no. --> Randy, yours looks good as well. --> If you want to apply a label or any of those types of things, have at it, but it looks --> like yours is exactly like mine, and hopefully it worked. --> If it didn't, let me know if you have any questions. --> See, Brett, you added some color to your processors, like I did. --> Good job. --> I didn't explain that. --> So, you know, good job finding that out. --> You are now, you know, bending your arrows, you're beautifying this. --> I think it looks great. --> I really like the color scheme. --> You can go through and add labels like I did. --> If you wanted to, you could group these into a label. --> But overall, this looks great. --> If you have any questions, let me know. --> Amanda, Amanda, also the same, you know, added some colors to this. --> The only thing I do, maybe Amanda, is just like left to right. --> You have obtained a CSV down to update schema. --> You may just want to put these processors in order from left to right, and then, --> you know, go down to your log messages. --> So, you know, you keep your failures going left to right, and you keep your success --> left to right, and then use, you know, any downward flow as your log message. --> But it looks like you got it, you understand. --> So, you know, that's just any tips I would give. --> And finally, Alyssa. --> Perfect, perfect. --> You got naming. --> If you want, you can name your connections. --> You can apply some labels, you know, change color of your processors, --> those types of things. --> But yours looks good as well. --> Hopefully, you didn't have any issues. --> If you do, let me know. --> But no, great job, and it looks, you know, spot on. --> Okay. --> So, any final questions about your data flow? --> And again, you know, we used this to pick up a CSV file. --> We created a controller service, and, you know, then we used that service --> to do a CSV reader and have the data read in. --> We created an arrow schema that we just copied and pasted in, but, you know, --> if you were creating a schema, you could copy and paste your own schema in. --> And then we, you know, we used that schema to run the CSV files through --> to create a JSON document, push that back to the file system where we --> renamed it to a .json, even though it still had the .csv during lunch. --> I'm going to work on mine to come up with a regex pattern to really --> show that name off, you know, while I'm sitting here eating, and I'll paste in chat, --> you know, how to do that, you know, for those that want to update the, --> you know, their file name. --> So, any final questions before we move on to not file registry? --> Go ahead. --> So, on the get file, I noticed there's like a little symbol on it, --> it's like a shield, like a half red, half white shield. --> Did you cover what that meant yet? --> Let me see here. --> Right here? --> Is that the one you're talking about? --> Yeah, so there's that little shield, I'm just curious what that means. --> Let me see where I set that. --> It's on mine, too. --> So, there's the processor name shield. --> The red and white shield. --> The red and white shield, okay, I see this, you're talking about right here, --> this little icon here, right? --> Yes. --> Okay, sorry. --> So, that red and white shield, you know, in the UI indicates --> a restricted component processor. --> Meaning, these processors can be used to run unsanitized code --> or get data on a host. --> You know, it's just a quick visualization that, you know, --> there's, it's like risky, bingo, that's the word I'm looking for. --> So, if you noticed, it's on the get file --> and the put file. --> So, you know, the risk level is a little higher there. --> I figured that's what it was. --> So, if we were going to run like, I'm assuming there's like --> some sort of Python code. --> Would it also have that simple on it? --> It could. --> So, processors that are able to run code unchecked, --> you know, may have that as well. --> There's, so most likely, I think it does. --> Let me, actually, let's just look at it and see. --> We can drag the processor down to run Python code. --> Come on, wait and see. --> Yeah, you see the shield is right beside the script, --> validate record, invoke scripted processor, execute script, --> you know, those types of things. --> So, yeah, it is, yeah. --> It's like some sort of security. --> Yeah, it's a more risky thing. --> You can see, you know, where a lot of these --> is where it interacts with something outside of NAFA, --> right, because if it's in NAFA, --> if we're creating attributes, if we're dealing with data, --> you know, those types of things, you know, --> we, you know, it keeps it within NAFA --> or if there's like a secure connection, --> you know, those types of things, --> then we, you know, we won't have that shield. --> But yeah, that's a great, --> I've actually never been asked that question --> why there's a shield. --> But yeah, it's risky, you know, --> and it's a restricted component processor --> that can run unsanitized code, you know, --> or change or get data on the whole system, --> you know, those types of things. --> Great question though. --> Okay, any other questions, you know, about our flows? --> I had a question. --> Yes. --> So I know you mentioned that we have to create --> our own schema for the files. --> Yep. --> Okay. --> Well, what if, for example, in our office, --> we have a bunch of CSV files. --> Well, they don't follow like a very straightforward --> like a column or separated by values format. --> How can we come up with like a custom schema? --> So what I like to do on those cases --> is I like to read the CSV in, --> and you know, you can actually use, --> you can actually do like a lookup on that CSV as well. --> You can do a regex pattern. --> You know, so in the past, --> what I've done is used like a regex pattern --> to extract the data out of the CSV files. --> And I will take that data, extract it, --> and push it as an attribute. --> So if you look, you know, for JSON, for instance, --> you know, I can do, I can read that JSON, --> evaluate JSON path, for instance. --> You know, I can read that JSON that we, --> we, you know, wrote. --> I could change this to a flow file attribute --> and now I can read that JSON in --> and take all the fields and I can go through --> and start defining the fields, adding properties, --> you know, regex patterns or other things --> to pull the data out of a JSON --> and create an attribute with it. --> And then once it's an attribute, --> I can, you know, match, --> I can do a lookup if I needed to. --> I can match it to some other data if I needed to. --> And then I could, you know, extract that data --> even from a CSV, pull that data in, --> have it as an attribute, and then turn around --> and write that as proper JSON or proper CSV. --> So there's a couple of different ways to handle, --> you know, a mix of data. --> But yeah, it's, you're looking, --> if it's a lot of different formats --> and a lot of different things, --> you are now looking at, you know, --> doing regex patterns to extract that data --> or using the CSV reader --> and that controller service. --> So you may want, you know, --> if you remember in the CSV, --> that's a, this is a really good question, --> so let's go into that. --> In the CSV reader, --> we had like, you know, what is the record separator, --> the value separator, those types of things, right? --> So you could actually, depending on the content, --> you could read the contents of a CSV file, --> determine the, you know, the separator --> or something else that distinguishes it as different --> and send it to its own CSV reader. --> Where you have some of these, --> like the quote character may be different. --> Instead of a CSV, it may be a tab, a TSV. --> So, you know, you may have a tab in here. --> So you may have a few different --> CSV reader controller services --> and you will read the contents of that message --> or that piece of data --> and filter and route, you know, --> to the correct CSV reader. --> Does that help answer your question? --> If you have any like, --> like a better example of the data itself, --> I can help you like quickly just on a flow even. --> So, go ahead. --> Yeah, yeah, that definitely helps with that. --> I mean, I can show you what I have, --> but I don't think it's pretty. --> No, no. --> But another quick thing, --> is there any way to, since I'm a programmer, --> so I think I like to code, --> is there any like ways we can embed any code --> in like a data set or not? --> So, do you like Python? --> I mean, I know everyone works with Python. --> I haven't worked in a long time, --> but I mean, I'm trying to figure it out. --> You could --> do an execute script. --> Oh, come on, latency. --> You can do an execute script --> where you send that data and execute a script on it. --> You know, they're starting to get away --> with some of the scripting language --> that's used just because of security risks. --> But, you know, you could do --> Closure, for instance, or Groovy. --> In this processor, --> we have some other, --> and you can still do Python. --> It's just, you know, --> they're letting you know --> they're getting rid of that. --> We also have... --> Are they getting rid of Python? --> No, they're not. --> No, no, no, not at all. --> Actually, they're getting rid of it --> through that processor. --> The newest version of 9.5, 9.5 2.0, --> you can actually create Python processors --> because, remember, you know, --> yesterday we went over what a processor is --> and, you know, basically it's a Java jar. --> And so, you know, --> instead of creating a Java jar, --> Java nar for this instance --> with your own custom Java logic, --> you can actually create a Python processor --> to accept that incoming data. --> And then you may have, like, --> in Python a script to parse that --> and push that out as an attribute, --> for instance. --> You do have that capability. --> You also have... --> You can invoke a scripted processor. --> You can script a filter record. --> There's a few different ways --> where you can actually then use code, --> you know, to do these types of things. --> Even a scripted partition record, --> those types of things. --> So you can run your code, --> your custom code, you know, in this. --> You can invoke scripts. --> You can, you know, --> do those types of things if you'd like. --> Awesome. --> And if you get a free second, --> just chat with me, chat, like, --> a couple of CSV examples you're talking about --> and, like, later tonight --> I can whip up a quick... --> I'm not going to do the flow front to back, --> but I can show you an example --> of how I would do it --> using best design principles. --> Yeah, I mean, I can give you... --> Because we mostly deal with financial data --> and the way that it comes from... --> It comes from a website --> that we have to manually download the files. --> So it comes with a lot of extra tags and stuff. --> We have to strip up a lot of that. --> If you noticed in the downloads folder, --> one of my other hands-on --> is actually pulling from a website. --> How do you download that? --> You manually download it, you said. --> Is it like a zip file every day or what? --> It's a combination. --> It's a long process, --> but it comes in different CSV files --> or some of them have to be converted --> into cell files. --> Oh, okay. --> I was thinking if it's like an API you hit --> and it will give you the file, you could... --> Charlie, we've been asking for that --> for many years now. --> It's not the thing that happened. --> Oh, okay. --> Yeah, so it sounds like the source of the data --> isn't as automated as you would hope, --> but if they do get to that point, --> just remember we have HTTP processors as well. --> So you can actually get... --> You can handle HTTP requests. --> You can invoke HTTP. --> You can actually listen or post. --> There's a lot of capabilities there. --> So just for FYI, --> if your source gets to that capability --> where you can automatically pull, --> you may set up a dataflow to pull once a day --> and then you can take their output, --> filter, sort, parse it out, put it back together, --> and then you even have the capability --> to export as Excel. --> How about... --> I'm sorry, I don't mean to take over. --> Any SQL, MS SQL? --> We work with the SQL a lot. --> Yeah, so mostly if you work with like --> MS SQL, for instance, --> and I know you have access and stuff like that, --> but you have tons of put SQL query database table, --> query database table record, query record. --> You can list the database tables. --> You have tons of SQL capabilities. --> Does that help, Pedro? --> Yes, sir, I appreciate it. --> Okay, perfect, perfect. --> All right, any other final questions before we move on? --> Awesome, awesome, awesome. --> All right, so the next thing we're going to do --> is we've now built a couple of dataflows. --> You know, we do not want to take it too far. --> We're going to take a chance on turning a flow on --> and basically ingesting our own non-fi instance --> and breaking it. --> Maybe this crashes or something else like that, --> so let's start talking registry and non-fi registry. --> So if you can, --> you want to go back into your folder. --> You want to go back to your downloads. --> And we are going to install non-fi registry. --> So to do that, you want to go to the non-fi registry bin.zip --> which is right here. --> Go ahead and extract that, --> but don't run it yet if you don't mind. --> So extract it into the folder. --> You should see non-fi registry, --> you know, just like you were doing with non-fi. --> Give you just a minute on that. --> Okay. --> So you should have that extracted. --> We want to go into the non-fi registry folder. --> I have already executed mine, --> but, you know, again, it's the same principle with non-fi. --> You're going to have some already set folders --> and then when we start this, --> it will create the rest of the folders that you need. --> But the main folder that we are concerned about --> with registry is the comp folder. --> And so if you can, you know, go into your comp folder --> and then look at non-fi registry. --> So registry is pretty easy compared to non-fi. --> Some of the properties are, --> it's a lot less properties, thankfully, --> a little bit easier to configure, --> you know, but we'll kind of go through --> some of the more advanced configuration options. --> Your web properties, you should be able to see that. --> You know, we are basically listening on 0.0.0.0. --> You know, so it's going to bind to all the ports --> or bind to all the hosts. --> And then also the port is 18.080. --> You can change that here. --> I would just leave it the same --> because mine's set up for that --> and we're going to work off of that. --> How many threads the Jetty server should have --> for some of these more advanced like sysadmin type work? --> You know, you may want to mess with those. --> There's a lot of different tuning --> and performance considerations to take in --> if you're building this out in a more scalable --> excuse me, production way. --> Security properties, you know, --> here's where you're going to go in. --> You're going to start setting your key store, --> your trust store. --> You know, you can use the authorizers.xml --> that you may have created with Nafa. --> And so, you know, you can reference those. --> We will for this scenario where we do not have, --> you know, security enabled for the registry. --> You know, it's pretty much wide open --> but like Nafa is just a single sign on as well. --> So, you know, there's a lot there you can configure. --> And then the providers, you know, --> this is where you will go in. --> And so with registry, --> registry is backed by your versioning control. --> I mean, what do you all use for versioning control? --> Do you GitLab, GitHub? --> Azure has a Git service, I think. --> Are you, anyone able to tell me or is it? --> We use just two DevOps. --> Azure DevOps, okay. --> Okay, so there is a, --> let me take note of that. --> You can use Azure DevOps --> and I'll take note of that. --> I haven't configured Nafa registry to use that --> but I'm taking a note because I want to make sure --> I send you the right documentation for that. --> So, in your real world environment. --> It does. --> Okay, yeah. --> Yep, so perfect, perfect, perfect. --> Yeah, no, so under the hood it uses Git --> but we need to specify some things --> in that, you know, for Azure. --> So, and then you can use GitLab and GitHub as well. --> You know, the beauty of using Azure --> and this isn't really well known yet. --> So, Microsoft is incorporating Nafa --> more and more into its Azure stack. --> And, you know, I don't have 100% confirmation --> but, you know, Microsoft has been reaching out --> to some of us because they plan to make it --> as a service, as part of Azure. --> And so, you know, you will potentially --> in the future be able to configure this --> a little bit easier just because it's going to be --> officially supported as the Azure stack, --> if that makes sense. --> So, anyway, so, yeah, in the provider's file. --> In case it's relevant to your notes, --> we have an on-prem Azure DevOps, --> not the cloud version. --> There are differences. --> I don't know if you're... --> Since you were writing that down, --> I just figured... --> No, thank you. --> No, and like what I like to do --> is just get as much information as I can --> because at the end of this class, --> I want to make sure that I send all these notes out --> and as well as some more advanced capabilities --> and give you like a little handbook to work off of. --> And if I know the environment you're working off of, --> I can tailor that so when you get the email, --> we can go from there. --> So, it's Azure DevOps on-prem. --> Okay. --> So, you know, you'll actually go into your providers, --> into your conf directory to do that. --> If you look, you will see a providers. --> This one, for instance, --> you know, everything is commented out, --> but if we, you know, you would just use Git, --> access user, the password, --> the repository to clone, --> and those types of things. --> Like I said, I do not have a GitHub set up for this, --> but what I will do is go through --> and give you some directions --> on setting your GitHub up for this --> or GitLab or Azure DevOps --> and, you know, it will help you --> when you're configuring this in the future. --> We are worried about right now is non-fi to registry, --> but like I said, I'll give you instructions --> on registry to your Git version control. --> But you will define that in your providers, --> you know, just like the properties say. --> Any database properties under the hood, --> registry, non-fi, --> they like to use the H2, --> that's constantly being updated. --> Any extensions directory as well for AWS, --> there's some special configuration there. --> Identity mapping, you know, --> there's some additional security, --> Kerberos properties, you know, --> those types of things. --> So, you know, when you start putting this --> more in production, --> you're going to look at your conf directory first, --> start getting that filled out, --> and go to your providers and get that checked. --> What I will do is I will try to get you --> a good example of your providers --> so you can use your Azure DevOps, --> but, you know, this is where you will define it. --> So, with that being said, we have... --> Can I ask, what do we get out of connecting it to DevOps? --> Because I get the value of a registry, --> but if we just don't connect it to DevOps, --> what would be the difference? --> So, the way this process works is --> NonFi communicates with registry --> to store and version, you know, --> all of their data flows. --> But it's not backed, --> registry is not backed by Git --> or a true version control system, --> and so you would plug your registry --> into GitHub, for instance, --> and so when those flows are committed, --> you know, registry will take those, --> you know, create the history file --> and then push as a Git push --> to your Git repo. --> So, that way you have, --> you know, you may decide that --> as part of your CICD process, --> you will take a flow and push that out --> as well as the, you know, --> say you had an Ansible, --> you know, playbook set up to deploy NonFi, --> and you need to feed it --> the flow that it's going to use, --> and you can pull that --> from your Git repo like GitHub, --> for instance. --> Does that... --> Okay, so you could do that --> without connecting the registry --> to like a Git... --> Yes, yeah. --> Because we are going to save the versions --> into registry --> and not registry to GitHub. --> So, that final step, --> you know, would be in your environment --> you would have NonFi to registry --> and then registry to your Azure DevOps. --> Think of registry --> as your translation layer --> to get your data flows --> into a versioning control --> as well as a UI --> to manage your data. --> Okay, because I know --> I was speaking with James, --> he had, from what you see, --> he had been the one --> who has been messing with NonFi. --> He's not in this training, unfortunately. --> He developed a flow --> on a set production instance --> even though we don't have --> a real production instance, --> and then he exported it --> to like an XML or something --> and then had to convert it --> and then import it --> to another instance. --> So, this would be, --> this would be instead of that, --> tedious process. --> Yeah, so, yeah. --> And I'll show you --> when we set up registry, --> you know, the beauty is --> registry will segregate --> everything into buckets. --> And so, you can go in --> with permissions, for instance, --> say that, you know, --> user X only has access --> to these buckets, --> user Y only has access --> to these buckets, --> but, you know, --> if you have X and Y --> working on the same thing, --> they can actually, you know, --> commit to registry, --> and then user Y --> can check that out --> and pull the latest version --> and continue working --> on the flow as well. --> And then it's just, --> you know, --> the committee's back and forth --> and then registry is going --> to take that flow --> and push it to --> a versioning control system --> like GitHub --> that will keep track of it, --> you know, --> you can branch --> and all those things. --> And then you will, --> you can use that flow --> as part of a CICD process --> to push that flow out. --> You know, --> here's the version --> that you need to push out --> for dev or prod or whatever. --> And you can have that --> as part of your --> CICD process as well. --> Thank you. --> Yep, yep. --> Great questions --> and this is --> the perfect time to ask it. --> Okay. --> So, --> we now have --> all of our files extracted. --> We've looked at --> our comp directory. --> There is a database directory, --> you know, --> it's an H2 that's --> keeping track of things. --> You know, --> there is docs, --> every NiFi component --> except for like --> some of the minify stuff --> that we will go into later. --> You know, --> has docs built in --> just in case you're --> on a network --> that doesn't have access --> to the internet. --> Lib directory, --> you know, --> just like NiFi --> has that lib directory. --> There is an extensions directory --> if you want to build --> onto registry. --> You know, --> just like NiFi, --> you can build --> a processor --> and put it into --> the extensions --> and hot load it. --> There's some extensions --> you can build --> for registry. --> I don't know --> if I've ever seen one --> built, --> to be honest, --> but the capabilities there. --> And of course, --> your logs directory --> because that way you can --> see what's going on --> and those types of things. --> So for this exercise, --> let's go into --> the bin directory. --> And we want to run --> NiFi registry. --> And so that should bring up --> a new window --> that, --> you know, --> is our NiFi registry. --> Let me check your screen. --> And then once it starts, --> if you can, --> open your browser. --> If you want to go to, --> if you remember, --> we set our, --> you know, --> we didn't change our port, --> but it was 18.080. --> So if you can, --> when it's up and running, --> bring up a new tab --> and you want to log in --> and go to 127.0.0.1 --> colon 18.080. --> Sorry, Joshua, --> I'm stuck for a little bit. --> Can I make that last step? --> Oh, no worries. --> So in your bin folder --> in NiFi registry, --> you've extracted --> NiFi registry, --> you've extracted that zip file. --> In the bin folder --> is the run NiFi registry bat file. --> And so you want to run that --> and it's going to bring up --> a new command line window --> that's running. --> So you should have two, --> one for NiFi, --> one for NiFi registry. --> And then once registry starts, --> you know, --> you got to give it a minute. --> You should be able to log in --> to the, --> to be able to log in --> to the, --> to the browser. --> Ben, where are you at? --> Ben, you blew up. --> Oh, I'm kind of like, hold on. --> I downloaded 2.0 and that wasn't that 45. --> Can you say it? --> Yeah, the one, --> 2.5 should be, --> oh, you're re-downloading. --> Yeah, 45. --> So make sure you're downloading registry --> so scroll back up. --> So you're downloading --> and click on registry --> and you can use 126, --> binaries. --> I know, I know. --> If I could change it, --> you know, change this, --> trust me I would. --> Okay. --> So Ben's downloading --> the registry now. --> It should also be in your folder. --> If you don't have it, --> you can go to the NotFile website --> and download it. --> The reason, --> you know, --> I would tell this is the more technically advanced class. --> You know, --> if it wasn't the technically advanced class, --> I may have already have this installed --> and running for everyone. --> But, --> you know, --> so because we're the advanced class, --> we're going to go through --> some of the technical aspects of this, --> setting it up and things like that. --> So yeah, --> if you don't already have it downloaded, --> you can download it yourself. --> Again, all of this is free and open source. --> And we just have it. --> What was the URL for that new network? --> Yeah. --> So if you want to download it yourself, --> it should be available on your --> NPEG or I'll pull your screen up. --> And go to notfile.apache.org. --> Go to click download --> and you click on registry. --> Scroll back up. --> There's three tabs, --> NotFile, Minify, Registry. --> Well, four tabs and FBS --> is the Flow Design System. --> Click on registry. --> Go up, --> up top. --> There you go. --> Perfect. --> Scroll down a little bit. --> And you see the 125 registry binaries. --> Not the source, --> unless you want to build it. --> There you go. --> And it should have been available. --> I think you're good. --> You click the HTTP site. --> If it doesn't respond in a minute, --> click the backup site. --> But it should download those files. --> And what I was saying is --> the NotFile registry --> should it be already in your downloads folder? --> When I created the VMs, --> I created it with it all as a zip file. --> So you don't have to download it, --> but you may have picked all your zips up --> and destroyed it. --> I think we all messed up the zips. --> Ah, there we go. --> I had to re-download. --> No worries, no worries. --> Like I said, --> we use that zip flow file --> as an example of what bad can happen as well. --> So yeah, just download it. --> Once you've got that zip file, --> extract it and go into... --> We're not going to change anything in the properties, --> but then go into the bin file --> and click Run Registry. --> And I'll give everyone... --> I get the connection refused, so... --> Right. --> So I think I have that. --> So then what happens after I run the registry? --> Okay, so after you... --> So Pedro, it's running. --> I think so, yeah. --> Let me look at your screen. --> Yep, it's running. --> So now in your address bar, --> go to... --> Just type in 127.0.0.1 --> and you want to go to colon --> 18-080 --> and hit Enter. --> Oh, in front of that, put NAFAI-Registry. --> Do a front slash --> NAFAI-Registry. --> NAFAI-Registry. --> Oh. --> Okay. --> So it should look like that. --> You don't need to use HTTPS or anything else. --> Okay, perfect, Pedro. --> You're in. --> In my clean. --> Okay, anyone else having an issue? --> Yeah, I am. --> Sean. --> Hey, Sean. --> Let's take a look. --> Mine's not detected. --> No worries. --> We'll see what we got. --> Got it running. --> Uh-huh. --> I think I'm doing something wrong here. --> Yeah, I think you have HTTPS. --> It's just HTTP because... --> Yeah, perfect. --> So take your HTTPS out. --> There you go. --> Hit Enter. --> And you're good. --> Okay. --> So NAFAI out of the box --> comes with a self-signed certificate --> with a username and password. --> And, you know, the reason being is --> is we had tons of people downloading it, --> putting it in AWS, --> leaving it wide open to the world. --> And I can create a data flow --> to download whatever kind of, --> you know, malicious activity I want --> and, you know, go from there. --> So Registry, because Registry's not actually, like, --> executing code or anything, --> they decided to just leave that open. --> But you do have the capability --> to configure security in the conf files. --> You know, you want to put, you know, --> modify your authorizers, those types of things. --> But you're where you need to be. --> So how's everybody else doing?