2:05:45
2024-05-20 09:46:48
2:09
2024-05-20 12:30:32
2:41:18
2024-05-20 12:33:23
1:36:58
2024-05-21 08:00:54
5:24:36
2024-05-21 10:06:11
3:24
2024-05-22 06:36:04
9:25
2024-05-22 08:03:05
40:22
2024-05-22 08:14:12
2:49
2024-05-22 09:47:03
1:48:29
2024-05-22 09:50:24
1:57:28
2024-05-22 12:09:49
Visit the Apache Nifi GROUP 2 course recordings page
WEBVTT--> Perfect. --> Perfect. --> Thomas, did you make it back yet or did he get called out to another call or something? --> Sorry. --> Were you looking for me? --> Tom? --> Hey, Tom. --> Yeah, if you can, can you go ahead and start your desktop, get logged in? --> I think it should be good to go. --> Sure. --> Thank you. --> All right, looks like everyone is coming up. --> Peter, just so you know, the Uploads folder is what I uploaded back. --> I actually need to upload a newer presentation, so the PDF you see there has a couple of errors. --> So later today after the class, I will give an updated PDF. --> I also will email it out, it's there just for reference for the most part. --> I still need to email it out to the previous class, and so I have all of you all's email --> addresses now, so I will handle that. --> So let's get started. --> All right, so looks like everyone logged in to DAW Desktop, depending on the system --> you're using and proxies and things like that. --> The last training class, a couple of folks got proxies to death, they had to have --> some fixes applied before things were working. --> But once you get logged in, you're going to have your virtual desktop, and so the --> latency sometimes can be an issue. --> You may point and click at something and it takes a second to respond. --> You know, you just have to bear with it, but this way it gives us that common training --> environment to work off of. --> So that being said, feel free to follow along with me. --> There's nothing specific you would need to do right this minute except for follow along. --> You can go through Non-Find, install it, and stuff like that if you want. --> That's what I'm going to do, but we will also have time to work on that as well --> in this presentation. --> All right. --> So just like I mentioned earlier, everything that is taught in this training class, --> well, except for the actual data flows and those types of things, is available online. --> So for some quick resources, nonfind.apache.org is the website to go to. --> The documentation is very, very extensive. --> As you can imagine, being a government system, it requires lots of documentation. --> You can download it from here, those types of things. --> So if you're at home and want to play around with Non-Find, have at it. --> You can go to nonfind.apache.org, click download, and download the latest release. --> So Non-Find 2.0 is out now. --> Well, it's not a full-fledged release version yet, but it's coming. --> You'll notice that even on the 16th of May, four days ago, --> they had an updated release to this pre-release. --> You're more than welcome to download that. --> For this training class, though, we're going to work off of Non-Find 126, --> just because that's the version that most people will... --> It's a major version. --> That's not the next major version in the version that most people have installed. --> So I think you all are running 125. --> In some instances, you may have a 126, but you can download that. --> You can download the source. --> If you want the source files and to compile Non-Find yourself, you can have the source. --> It is an open-source application. --> So you're able to download the source files. --> If you're a software engineer, you're able to go in and make changes. --> If you're running scans or vulnerabilities, have at it. --> There's a lot of capabilities. --> The administrator guide actually kind of goes into how to build Non-Find from source --> and those types of things. --> But for this class, we are going to work off of the Non-Find standard 126 binary. --> This is already pre-built. --> It's ready to go. --> We just need to install it. --> But it's not an install like you would think with a normal Windows application. --> Now, I chose Windows for this class. --> The last class, there was a couple of folks who really liked the Ubuntu Linux. --> But for ease of use, everyone on this class is using Windows. --> Non-Find can run on Windows. --> It can run on Linux. --> It can run on numerous types of devices and operating systems. --> So again, when I downloaded Non-Find for everyone for this class, --> I just went to this link. --> I said, go to the 126 binaries. --> And I clicked the HTTP. --> And there's a backup site as well. --> They're chosen based upon the CDN and stuff like that. --> It takes a second, and then it will start downloading. --> So Non-Find itself is pretty big. --> And the reason being is Non-Find has all of its processors --> and just everything going on here bundled up into this zip file. --> So as you can imagine, with everything installed coming with Non-Find, --> it's over a gig download. --> And so the zip file is pretty big. --> When we extract it, it's going to get even larger. --> Some of the newer versions of Non-Find should be a little bit smaller. --> I know that some of the processors that we're going to talk about today --> may not be available in the newer version of Non-Find --> just because they are taking some of that processors out --> and having it as an optional download. --> Just for FYI. --> It actually is downloading. --> You can see that it's downloading another version of it. --> It's 1.2 gig. --> I'm going to just delete it because we already have it downloaded. --> I've downloaded it for all of you all --> just in case it was a network connection. --> And honestly, having seven or eight people download it all at once, --> it might slow down a little bit. --> So again, if you want to follow up, you can follow along with me. --> You can pull up the website. --> You can look at the documentation, those types of things. --> So the main parts of the documentation that we are going to go over --> is part of the admin guide and the user guide. --> But like I said, this is extremely well documented. --> For an open source application, --> it's actually one of the best documented products out there. --> Not only do you have documentation about NAFA --> and what you need to do as an administrator or user, --> but you also have the documentation about every single processor --> that they support. --> Now, I say they support because the NAFA product itself, --> you're going to see it has 300-plus processors, --> but there's other processors out there as well. --> They just may not have the documentation --> that you'll see with the Apache-level product. --> So like I said, we're going to go over some of the things --> in the sysadmin guide. --> We're also going to go through some of the NAFA user guide. --> Those are two links that's included in the presentation --> I like to send out after the class, --> just so you have that for reference. --> We've already kind of talked about what is NAFA --> and some of these things. --> Here is some additional requirements and use cases --> and those types of things. --> But yeah, so we've talked about a flow file, --> a processor connection, those types of capabilities. --> Again, we'll walk through this. --> I find it best to just hands-on learning. --> But if you have any questions, --> you can revert back to this documentation. --> Another great thing about NAFA is all of this documentation --> that you're seeing right here is included --> with the NAFA product itself. --> So if you have a question on what does the git file --> processor do, for instance, --> you can go into your NAFA instance, --> even if you do not have internet, --> and pull the documentation. --> For this one, though, I like to just work off --> of the official documentation. --> But yeah, I can go right here, --> and as soon as the internet wants to respond, --> I will have the documentation on git file. --> There we go. --> So for instance, the git file processor, --> it creates flow files from files in the directory. --> It will ignore files it doesn't have --> at least read permission to. --> And then each processor has a property. --> Some are required, and some are optional. --> And then we also have some that we can add. --> There's a relationship to the processor. --> There's other attributes, things like that --> that we can take a look at. --> But for this part of it, --> just remember that the documentation is there. --> So everyone has NAFA downloaded. --> And I'm going to kind of walk you through. --> I was debating on whether to include this --> as part of the class, but I felt like --> I think we can all accomplish this pretty easily. --> So actually what we're all going to do --> is kind of install our NAFA, --> walk through what some of that means. --> If you don't understand it, --> or you have any questions or some additional details --> you might need, again, feel free to interrupt me. --> This time between now and lunch, --> I've set aside just to get this up and running --> and kind of go over what some of it means. --> And then when we come back from lunch, --> we'll actually start building a data flow. --> My goal by the end of today, --> you will be able to download NAFA yourself --> onto your own device if you need to, --> install that, get it up and running, --> and then build your own data flow. --> So by the end of the day, you're going to go from --> potentially never touching NAFA --> to having your own data flow running. --> So let's make that a goal, --> and I think that's a goal we can accomplish. --> All right, so that being said, --> everybody should be in their desktop. --> If you can, bring up your folder, --> your file explorer. --> It might take a second like mine to load. --> And everything that I'm going over --> is either in the downloads folder --> or the uploads folder on your desktop. --> So for this case, we're going to go --> to the downloads folder. --> There we go. --> And you're going to see a bunch of files --> that I've downloaded. --> Again, I've downloaded NAFA twice, --> so I'll delete that. --> There is, this is an executable I use --> to install Notepad++ --> because we're going to need to be able to edit files. --> But it's really easy to download. --> Once it's downloaded, you'll have a zip file. --> So for this instance, we have --> NAFA-126.0-bin. --> That tells me that this is not the source code, --> but it's actually a binary that's ready to go. --> So for this exercise, --> if you can click on the NAFA-126-bin, --> just a single click, --> and then right click to extract all. --> So again, this is just a zip file. --> We're extracting it. --> I'm going to leave it in the downloads folder --> where it's at. --> You can actually move locations if you want to. --> It's totally your call. --> And I'll leave it show extracted files when complete --> just because I want to kind of go over some of that stuff. --> So I'll just click Extract. --> It takes a minute. --> Again, this is a virtual machine. --> All of us running on it. --> I've given the machine eight games of RAM, --> eight virtual cores, --> and I think it was like 300 or 400 gig of space, --> but we won't even use that much. --> Well, that is extracting. --> Oh, it looks like everyone is good to go. --> Leroy, did you get yours extracted? --> Let's see. --> Oh, there you go. --> Give it another second. --> Like I said, it takes a minute to extract all of that zip file. --> That zip file is a 1.2526 gig file, --> and when it extracts, it's going to be even bigger. --> That's probably been the biggest complaint that I know of --> that the community, not the NiFi community, --> but the user community in general complains about. --> It's just the massive size to download this. --> But if any of you have played Xbox, --> you know some of these games can be 50, 60 game now. --> So yeah. --> All right. --> It looks like most everyone got that extracted. --> Give me just another second for Alderius to finish up --> and Peter's to finish up. --> All right. --> Let's go back to my Peter here. --> So it looks like it's finished. --> Perfect. --> So you should open a new folder in Windows --> and the only folder in that folder is a NiFi 126.0. --> So if you can double click and go into that, --> and then you should see a bin folder, a comp folder, --> docs, extensions, lib. --> These are not all of the folders that NiFi creates. --> This is just the initial downloaded install. --> When we get up and running and started, --> it's going to create some additional folders --> called our content repository, provenance repository, --> and a couple of other repositories. --> So I was talking about earlier how it keeps track --> of all the changes of data, that data provenance, --> that lineage, that pedigree. --> So it keeps track of that. --> And where it keeps track and how it stores it, --> the system itself keeps tracking --> and stores all of that information locally. --> So keep that in mind for those that are only --> infrastructure side of the house. --> And for those that will install this, --> some of the sys admins, you all would need to know this as well. --> But in general, these are the files and folders --> that NiFi will create when you extract it. --> As soon as we start the application, --> it's going to create some additional files, --> and that all lives locally. --> Now, depending on your strategy of deploying this --> and scaling this and some of the other things, --> you may want to have some of these content repositories, --> some of these other repositories on different network --> and stat attached storage. --> I know that for the content and flow and provenance, --> those are usually stored on high speed drives --> just because there's a lot of reading and writing back and forth. --> And then you'll have some of the other repositories --> that they really don't get used as much. --> They're still needed. --> So they may break this up a little bit --> and put some of these folders on some high speed drives, --> some of the other folders on some normal drives --> for cost savings and performance gains. --> But that all depends on your deployment strategy. --> Whenever we have time, --> and we're going to have plenty of time for this, --> but if you want to go very technical into details, --> and I'll be happy to give you my opinion on that. --> I can get very technical. --> I still write software. --> I still write software for NiFi even. --> But I kind of like the training part as well for this. --> So anyways, so when we extract it, --> we've got the bin folder, the comp folder, --> docs, extensions, and lib. --> Docs is docs. --> And I got mentioned earlier, --> everything that you can find on the website, --> you're going to get in the docs folder as well. --> And NiFi utilizes that docs folder --> to provide you information. --> The bin folder, that's your binary, right? --> This is where you would execute the start of NiFi --> and those types of things. --> We'll go into more of that once we start. --> But the bin folder for NiFi, --> it contains both Windows batch files, --> as well as Linux shell scripts. --> So if you're running this on Linux, --> you have a way to start NiFi. --> If you're running this on Windows, --> you have a way to start NiFi. --> So that's how you would start NiFi --> as well as some of those binaries --> to if you need to change a username or password --> or something like that, --> you can utilize those. --> The conf directory, which we are going to go into, --> is where all the configuration for NiFi exists. --> So all your properties and where does NiFi, --> what IP address is NiFi running on, --> what port number is NiFi running on, --> those types of things. --> So there is a lot of configuration. --> A lot of this is the security. --> So plugging in that security infrastructure --> and those types of things, --> you would do the configuration here. --> So what I am going to do though, --> and this is totally up to you if you want to, --> but if you go to nifi.properties, --> you should see nifi.properties. --> I am going to open that --> and go over some of the key points of the properties --> just so for those that are sysadmins and others, --> you'll have this information. --> I know for some of you who may not be that technical, --> this may be a little overwhelming. --> Again, this is just for information. --> You're more than welcome to follow along, --> but there are some key points --> that I feel like everyone needs to see --> as part of the properties file. --> So anyway, this is your core properties section. --> Again, a lot of this is documented. --> A lot of this relates back to the website even. --> So what is the main flow configuration file --> and where is that located? --> Of course, it's going to be your conf directory. --> Where is the JSON file? --> It's also there. --> You have Archive enabled, those types of things. --> So that's some of your core properties. --> Some of the other ones is your authorizers configuration file. --> This is where, as it was mentioned earlier, --> how you're trying to work on, --> the other organization is trying to work on --> getting NiFi installed, up and running, --> get the multi-tenancy, the multi-users. --> I think it was Brett is working on some of that. --> And so Brett would go in here. --> He would configure these properties. --> He would take a look at the authorizers.xml file --> and start building in some of his configurations --> he would need for security and user permissions --> and identity management and all that fun stuff. --> But that's where you would find that. --> But there's one key property that you can just tell --> that's come from the government. --> And that is the niFi.ui.banner.text. --> Now, this property lives there for, --> now it's for a couple of different reasons. --> But this banner, as you can imagine, --> you could put unclassified. --> You could put secret. --> You could put top secret. --> You could put kui. --> You could do whatever classification header --> you would need. --> So what that does is it provides the government --> an easy way to put the classification --> of the system on a banner. --> So when you pull up this niFi instance, --> you immediately see the classification of the system. --> Also, because that is such a government property, --> the way that commercial companies use it --> and others in the government as well --> is this may be our dev instance of niFi. --> This may be our test instance. --> It may be prod. --> And so I know a lot of companies --> that use this banner as a description. --> So you can quickly go to the UI, --> and you will immediately see, --> I am working on the test system --> or I'm working on the dev system. --> And so for me, I am actually going to put something in --> and I'll say, this is a test system. --> You know, I can put in whatever. --> Okay. --> And, you know, again, you don't necessarily need to do this. --> If you're following along, --> feel free to put in whatever you would like. --> This is your own personal niFi instance. --> And you go from there. --> Or you can just leave it blank. --> So some of the other properties --> that you would need to potentially look at --> if you're like a sys admin --> and stuff like that is, you know, --> where the NAR library, --> you know, all of the processors are, --> if you're familiar with, you know, software engineering, --> if you're not, it's okay. --> But, you know, in Java, --> we usually create a Java jar, a jar file. --> And we will then run Java-jar, --> you know, the name of the jar file, --> give it memory, you know, configuration and stuff like that, --> and execute and run that application. --> In niFi, they're called NARs, N-A-Rs. --> So, you know, it didn't take a lot of imagination --> to see where we stole that from. --> But NARs are basically Java jars --> built specifically for niFi. --> So, you know, properties like this, --> you know, where is that NAR library? --> You know, the autoload library. --> One of the things that we are going to do during this class --> is we are going to import a new processor --> and have it up and running and usable --> without ever stopping our data flow, --> without ever restarting the system, right? --> The data is still flowing, it's still working, --> and I'm going to go in, --> I'm going to deploy a new connection type, --> a new processor and build a flow for that --> and have that flow up and running as well. --> So, I will, you know, I'll show you how we do that --> and how that works. --> But just so you know, you know, --> the lib directory is the library directory --> that's where all the core niFi files reside. --> And you can see that, you know, in... --> We have extensions that we talked about. --> It's empty right now. --> And then we have the lib directory, --> which should be pretty full, --> and there's the NARs. --> So, you know, even in the file name, --> you can see niFi AvroNAR, --> niFi AWS service, AzureNARs, --> niFi Dropbox extensions, --> niFi GeoHash, GRPC, HL7, --> which is the medical format. --> You know, all of these are processors --> that come out of the box. --> They all live in the lib directory. --> And these will be immediately available --> as soon as we start niFi. --> And then, you know, like I said, --> there's an extensions directory. --> It's empty. --> This is where if we had a special processor --> and you had a CI-CD process set up --> where, you know, a developer could create a processor, --> it checks it in, it builds it, --> it tests it for vulnerabilities, --> you know, it goes through that whole CI-CD --> and DevSecOps, you know, --> policies and things like that that you have set up. --> You know, ultimately, it will spit out a NAR file, --> and that NAR file, you know, --> could be automatically installed --> into the extensions directory, --> and you would have immediate access to that processor. --> Not only would you have immediate access, --> but if the permissions and the policy was there, --> everyone would have access to that same processor. --> So, you know, as part of, you know, --> some of the usability points I was making earlier --> where you're able to reuse these components. --> So if I build a connector to say, you know, --> let's go, it's already built, --> but let's talk SQL Server. --> If I build a connector for SQL Server --> and test it out, --> it's went through the processes that, you know, --> you may have set up and things like that. --> It gets deployed to that extensions directory. --> Well, now everyone can use that connector. --> So as a different organization, --> I don't have to go and build a new connector. --> I can just reuse one that was already built, --> but I may be connecting to a different instance. --> I may be connecting to, you know, --> different usernames, passwords, --> authentication methods in SQL Server. --> You know, it may be the same SQL Server --> as just pulling from a different database --> or a different table, you know, --> those types of things. --> So, you know, that extensions directory, --> you know, is pretty important here, --> and that is how we hotload processors. --> So that means we do not need to stop data from flowing. --> We do not need to turn data flows off. --> We don't need to restart the application. --> It can run. --> Data can continuously flow through the system, --> and now I have a newer capability --> so I can connect to new data sources. --> So that's the purpose of the extensions and lib. --> Again, all of that is referenced, you know, --> into the 9.5.0 properties. --> You can change it. --> I've seen, you know, some folks change it --> to a different lib directory --> depending on their policies, things like that. --> But, you know, as a sysadmin, --> this is the section to do that. --> And then, of course, you know, you need a state. --> You need half-state. --> If you're working in a clustered environment, --> you would need ZooKeeper, --> which is another software application --> that is open source. --> If you've been around any kind of distributed system, --> clustered system, you've heard of ZooKeeper. --> ZooKeeper is widely used, you know, --> across the board with government and commercial alike. --> So, you know, here's where you would manage --> some of that state management. --> We have a database directory, --> so that's our database repository. --> Again, we have multiple repositories here. --> And where you store those is configurable. --> So you may, depending on if it's a repository --> that needs a lot of reads and writes, --> you may store it on a different type of system. --> So, you know, to save costs, I know, you know, --> some companies really streamline and fine-tune this --> where some repositories will live on a very high-speed SSD --> or, you know, even potentially in memory, --> you know, mapped back to the file system. --> And then some of these repositories --> that really don't have a lot of reads and writes, --> you don't worry about those as much --> in the performance aspect. --> You know, they may go on to a slower drive, --> you know, instead of having to either choose one or the other. --> Because this is so highly configurable, --> it's there so you can do those types of things, --> you know, reducing your cloud costs, --> your server resources, you know, your own Prim resource. --> I know you guys do a lot of stuff on Prim, --> so, you know, it may help reduce some of those resources. --> And that's the database settings. --> And there's a flow file repository, --> the content repository we talked about. --> And, you know, one of the things I like to point out here --> is that content repository is keeping, basically, your flow file. --> So if you told it to ingest a CSV, --> that content repository is keeping a copy of that CSV, --> you know, for the time being. --> You know, that's because if NaFa were to crash and shut down --> and when you restarted it, --> that processor that was processing, you know, that flow file --> is going to go back to the content repository --> and say, give me back that file, I need to finish processing it. --> And before a processor that's, you know, --> say we've got three or four processors chained together, --> you know, we get file and we send it to the next step and the next step. --> Well, that next step, if something crashes --> before it gets time to go to that next step where it completes, --> when NaFa comes back up, it will reprocess that content, --> that file based upon whatever processor was working on it --> because of that content repository. --> And, you know, just so you have a little bit --> of the underneath-the-hood workings of this, --> when a flow file or a piece of data, you know, goes to a processor, --> it will not release that flow file and that data --> until the next processor has it. --> And so what that does is it guarantees that a copy of that data --> is on the next processor doing that function --> and that next processor got a thumbs up --> from the previous processor that it was complete. --> So that way, you know, if something crashes --> and things like that, you don't lose data. --> Now, if it is in the middle of processing data and it crashes, --> it's going to try to reprocess that data. --> So, you know, just keep that in mind where you may get --> some initial results from NaFa, --> but you need some additional, you know, processing to happen. --> So you may get duplication of data because, you know, --> it produced 25% of the output. --> But, you know, before it crashed, when it come back up, --> it's going to try to redo that. --> And so, you know, you may get an additional duplication of data. --> Now, with that being said, --> we do have ways of dealing with that as well. --> There's actually a DDoT processor. --> I don't know if it's in this latest version, --> but I do know it's there because duplicate data --> is a pretty big issue in my experience --> in all the years I've been with the government. --> So, yeah, that is our content. --> Then you have all the provenance events. --> It has its own, you know, repository. --> When we start NaFa, that new folder is going to be created, --> and that's where the provenance events will go into. --> So you can specify how much, you know, Richard, --> I'm thinking about you here, --> where you may have an overarching data governance plan --> and strategy. --> And so, you know, you want your NaFa to retain --> the last 14 days. --> And then, you know, during that 14 days, --> you're offloading all that provenance information --> into a larger data governance. --> Informatica has one. --> You know, there's a couple of open source versions. --> You know, there's like Knox and Tika, --> or not Tika, but Ranger and Apache Ranger, --> Apache Knox, and a few of those tools --> that kind of work well with NaFa. --> So, you know, you may have a, you know, --> a corporate-wide or unit-wide governance policy. --> So that's where this would get configured. --> You can configure it to keep the, --> right now it's configured to keep all the provenance events --> for 30 days with a max storage size of 10 gigs. --> So, you know, so keep that in mind, you know, --> when you're building it and designing your system --> that if you have a ton of data coming through, --> you may want to, you know, --> those events are being offloaded, --> and you know, you have those data provenance events, --> so you don't need to keep 30 days worth of data. --> You need to only keep it for a week or a day. --> You know, I've seen this configured --> where it keeps it only for a couple of hours, --> because as those events happen, --> all of the data governance events --> is being offloaded to the, you know, --> the corporate-wide data governance system. --> And so, you know, this is highly configurable, --> you know, for you sysadmins out there --> as you start working through getting it installed --> and things like that, you know, --> pay attention to some of these properties, --> because, you know, this one, for instance, --> it'll take 30 days or 10 gigs to fill up, --> and so, you know, you may want to adjust those settings. --> Again, you know, you see a lot of times --> applications have settings that you can just, like, --> you know, go to a menu and select the setting --> and change it and those types of things. --> We have some of that in i5, --> but this is part of some of those core settings. --> There is no UI for this. --> You know, that was, you know, --> one of the things that we went over --> in the last training class was, --> you're gonna have to go in and edit these files. --> You're going to have to put in, you know, --> different properties based upon your organization. --> I wish there was an easier way to do this. --> I find that this way is not too bad. --> I find that setting up security --> and those types of things, --> now, that's the more difficult part, --> and the problem with that is, --> you know, if you do run into an issue, --> you're asking the community, --> you're asking Google, you know, --> or you're emailing me and saying, --> hey, Josh, how do I do this? --> You'll get my contact information --> at the end of the, you know, --> at the end of every day, I think it is. --> But I will be happy to answer --> any quick questions after this training. --> Do remember, like, you know, --> I'm delivering training --> and the support after the class, --> you know, falls upon you, --> but, you know, you now have a contact --> that is an original contributor, --> you know, still contributes, --> still uses it, still builds it, --> is still our design --> and architect solutions around this. --> So, and I'll give you my contact information, --> and if you have a quick question, --> feel free to reach out, you know, --> after this training class. --> But anyway, so, you know, --> that's the properties file. --> Another quick property --> you might want to take a look at, --> you have the remote host. --> So that is when we go into site-to-site --> because two NIFI instances can talk to each other, --> send data from one to the other, --> those types of things. --> So, you know, that's a good property --> you may want to take a look at. --> We have our web properties --> that right now, because of security, --> everything is going to run on your local host --> and the local backup secure port. --> Also, when we start NIFI, --> we are going to have to go into the logs file --> and find our username and password. --> So, a couple of versions ago, --> they implemented this change --> where every time you download and install it, --> it requires a username and password to log in, --> even on your local machine. --> The reason being, and I've seen it a thousand times, --> as a matter of fact, if you do some good Googling, --> you can still find NIFI instances --> sitting on an EC2 instance publicly exposed --> and no username, no password. --> So you can actually go into that instance, --> create a data flow that, you know, --> picks up data from something and delivers it to yourself. --> And this whole data flow is residing --> in someone else's instance, --> and so you're not paying for that resource. --> You're not paying for the EC2 --> and the data and stuff like that. --> So, you know, there was a lot of people --> that was just downloading, installing, and running, --> and they were getting, you know, --> just hammered by malicious activity. --> So NIFI said, you know what, --> we are going to mandate a username and password --> on every install. --> So that way, like, you know, --> that way nobody can just randomly come in, --> once you see the username and password, --> it's actually very difficult. --> You can't guess it. --> So, you know, it's going to be very difficult --> to make sure that, you know, --> someone can just come in and run a flow. --> So we'll go through more of that, --> but, you know, just a little background, --> a little history, you know, of what's going on here. --> Again, we're going to be, you know, --> our NIFI instance is going to be on localhost, --> which is the IP address of 127.0.0.1. --> You can also use localhost in your domain name, --> but I like going just to the IP address. --> The port is going to be 8443. --> So if you are at home and download this --> and you're like, oh, I can go to this 127.0.0.1 IP --> and it will work, no, we specify the port. --> And so we use 8443 as the default. --> 443 port, as you may know, --> is the secure port for, you know, --> most websites that you see. --> So when you go to Google, you know, --> you can go to http colon google.com --> and it will automatically redirect you --> to the secure version, --> the HTTPS of google.com. --> You know, same kind of principle here. --> You know, we have a very secure, --> instead of using this typical 443 port, --> it uses the backup SSL port, --> which is usually 8443. --> So we will need to specify the port in our browser. --> If we left this at 443, --> we wouldn't even need to specify the port. --> We can just say HTTPS and send it. --> So, you know, keep that in mind. --> Underneath 9.5, underneath the hood, --> is Jetty server. --> Jetty is another open source package. --> If you've ever heard of things like Apache Tomcat, --> JBoss, you know, these web server applications, --> you know, 9.5 is a web app. --> And so you need a server to run that web app. --> So under the hood of 9.5, that server is Jetty. --> There's a lot of configuration you can use. --> I don't see a lot of people messing with that. --> You know, just because, you know, Jetty, it works great. --> It's very lightweight in the performance there. --> Then you have some additional security. --> There's Apache Knox already mentioned. --> Some SAML properties, additional properties for multi-tenancy, --> you know, identity mapping, those types of things. --> You know, you have clustering and where's your zookeeper. --> So, you know, as a sysadmin, --> you may need some of these properties, --> but for the sake of time and, you know, for this class, --> we don't really need to worry about any of those others. --> But the main ones that we need to worry about --> is just where is this running? --> What's the IP address and port? --> Kind of like showing off the banner because, you know, --> it has that government, you know, --> even though it's an open source product, --> that property is still there, --> and it's there because of the government. --> The government is a contributor to this as well. --> So, you know, they keep tabs of 9.5. --> It's widely used, and so, you know, --> government employees and contractors, right, you know, --> they provide information back to the Apache Foundation. --> Hey, you know, either we need to build this --> or we built a patch and we want to include it. --> So, you know, just keep those things in mind. --> So, what we're going to do is close out of that. --> We'll just close out. --> Actually, we'll bring that up one more time, --> make sure I saved it. --> So, I'll put in my banner. --> This is a test system. --> Saved. --> All right. --> So, also, if you notice, --> there's no repository folders and stuff like that --> that we talked about. --> There's also no logs directory --> because all of these things are going to be created --> when we first execute 9.5. --> So, I went over the lib directory, --> the extensions directory, the docs of what they have, --> conf directory. --> A lot of you will not even touch this probably, right? --> You're going to rely on your sysadmins --> and others to get this going. --> But I like to go over it because, you know, --> anyone can download the application. --> You know, I was teaching my nine-year-old how to do this. --> So, she can download it and start playing around with it --> and run her own configuration if needed. --> You know, so that information is there. --> Most likely, you know, you won't need it. --> But for you sysadmins on the training today, --> you know where to get things. --> Well, that being said, now I'm going to open the bin directory again. --> And so, some of 9.5 requires a few things. --> So, when I say we are installing 9.5, --> technically it's installed. --> 9.5, when you did that extract zip file, --> you installed it. --> You installed it into that directory --> that, you know, was extracted into. --> You know, that's a positive --> because you don't need to actually do installing onto the operating system. --> 9.5, you know, you can run 9.5 without that installer --> and installing it in Windows. --> A lot of times, Windows has some restrictions on what gets installed --> and those types of things. --> So, 9.5 is a very portable. --> You can download it. You can run it. --> You know, your Windows at home or something may block, --> like localhost and those types of connections. --> Usually they don't, --> but there could be some additional security you would need to worry about. --> But for this instance, we should be good to go. --> So, what I like to do is, you know, that is installed. --> It's up and running. --> You know, those types of things. --> So, what I like to do then is actually run 9.5 --> and that way I can go in and start looking at it. --> So, you don't necessarily need to follow along, --> but you're more than welcome to. --> So, what I like to do is just double click run 9.5 --> and, you know, Windows is going to make sure that I can run it --> and all those things can go from there. --> So, I'm going to say run it. --> It's going to bring up a command line prompt --> and it's going to start, you know, --> it's generating a self-signed certificate, --> you know, those types of things. --> You know, so give it just a minute. --> It's going to come back up and running. --> While I'm waiting on 9.5 to come back, --> again, a lot of this is in the sysadmin guide. --> You know, how to install and start 9.5. --> You know, so here's the Windows. --> Here's the Linux version. --> When 9.5 starts up, the following files and directories are created. --> You know, we talked about these repositories, --> the logs directory. --> There's a work directory, but it's like basically here's the PID, --> which is the process ID. --> Not a lot of information in that one. --> And then the conf directory, this flow.json.gz file is created --> because that's the actual flow files that you've built get saved. --> And so, you know, it makes it where that's quasi-portable as well. --> But that's how it reads what initial flow files it needs to load, --> you know, upon startup. --> The flow.json.gz is empty for us because, you know, --> this is a brand new install. --> But, you know, once we start building some flows --> and those get automatically saved, --> you're going to see the size of that file increase. --> So again, all of that is here. --> You know, if you want to run it on Windows, --> just double-click and, you know, just start NaFi, right? --> There's also a capability to install NaFi as a service --> on both Windows and Linux. --> So when Linux starts, you know, you may have a startup where, --> you know, as the server starts, --> it automatically starts NaFi and it's up and running. --> You may want to with your Windows laptop --> or your Windows machine at home. --> Or, you know, if you have permission, you know, --> to install this at work, you know, at work, --> where you're able to install this as a service --> and then that way, every time your laptop starts, --> it automatically starts Windows or NaFi as well. --> And so, you know, it requires admin rights on the box --> to do the service, you know, so kind of keep that in mind. --> But, you know, you do have that capability. --> But again, you can download the source code --> and build a custom distribution. --> I know a lot of people who do this that deal with the CICD process --> because NaFi is massive. --> You know, we installed it and we started it. --> We haven't even brought the UI up. --> We haven't even built a flow or anything else. --> And the download was 1.26 gig. --> And we are now just extracting it, 2.46 gig. --> So, you know, that's a pretty substantial size application. --> But if you look at like Minify that can go on an edge device, --> that's less than one meg. --> And so, you know, there's a lot of capabilities here, --> a lot of flexibility. --> So I know a lot of people who will build their own distribution --> just so they can make sure they only include processors they need --> and not any of these additional processors --> that will either never be used. --> There are additional assets that need to be managed. --> You know, you got to look at the, you know, --> is there a vulnerability, right? --> Remember the log4j vulnerability? --> I know you guys know about the log4j --> because it was brought up multiple times in the last class. --> But, you know, NaFi, for instance, --> swapped to logback, which is another logging application. --> It's based off of log4j, --> but it was the original contributor to log4j. --> He started, you know, another logging service that's more secure. --> And so NaFi, just so FYI, --> NaFi uses logback instead of log4j. --> Now that's not saying someone can create a processor --> that has some log4j components inside and utilize those. --> So, you know, just keep that in mind. --> But, you know, for security reasons --> and just distribution reasons, --> you may want to build your own from source --> and not include some of these processors, --> you know, and those types of things. --> But for us, we're going to run with what we have. --> Okay, so if you're following along here, --> you should, you know, get a message that, you know, --> your final message should be something like launch Apache NaFi, --> but could not determine the process ID. --> That's totally fine. --> It's just a warning. --> It can't determine the process ID. --> There's some additional configuration we need to do, --> but it's okay. --> It's there. --> So NaFi, again, it does take a few minutes --> on that first time of starting to actually be up and running. --> So even though it tells me that it launched NaFi, --> you know, I could give it a couple of minutes --> just because it is creating the content repository. --> It's creating those logs and everything else. --> And then once it's up and running, --> and, you know, once you get that message back in Windows --> that, you know, it's running and can't find its PID, --> but that's okay. --> What I like to do now is go back and now look, --> and you see we have the different repositories created. --> You know, initially, we only had like five folders. --> We've doubled that. --> We have our provenance repository folder now, --> our flow file, the database, the content. --> We have logs. --> We have work. --> We have run. --> We have state. --> You know, there's a lot of additional. --> But for this exercise, I'm going to go into the logs directory. --> And there's primarily, you know, just a, you know, --> you've got five different logs here, --> but the primary log that you will be working with, --> you know, if you need to work with the logs, --> is the 95 dash app log. --> That's where most of the activity occurs. --> You know, users logging in, data flows being added, --> processors being added, you know, --> data flowing through the system, right? --> Any warnings or errors. --> Also, you'll see when we're building a flow, --> I really like to use the log message processor. --> So when I do that, it will send a log message, you know, --> to this log about a data flow, right? --> And so I like this log, you know, from a sysadmin. --> If I put my sysadmin head on, --> this is my favorite log to look at. --> So with that being said, I am actually going to open this. --> We'll go into more in depth later, --> but of course it's going to tell me it's been changed --> because this is a live log. --> But what I like to do is, --> so I mentioned that NAFA, when, you know, --> up until recently, you could download it, --> you could install it, --> and have it up and running in a few minutes, --> but everybody in the world could access it --> if it was on a public IP or something. --> So what they did is they went through and said, --> okay, we are now going to secure every install. --> We're going to generate a username and password --> that is unique to every install. --> So to find that information, --> you actually have to go into the navi-app.log folder --> and look for username. --> And you're going to see in this log folder --> a generated username and a generated password. --> That is going to be our username and password to log in. --> Yours is going to be different. --> This is a very unique EUID that is generated. --> And so, you know, your username and your password --> is going to be different. --> I'm going through this right now, --> but we, as an exercise, you know, --> I'm going to have you all, you know, --> basically do the same. --> What I like to do, --> because there's no way I can remember that much information, --> is I like to copy it, --> and I will actually put it in a new document, --> because that log file is going to go away. --> You know, as we process data, --> it rolls over to a new log file. --> You know, there's a lot of information in that log file, --> so I like to just pull out that username and password, --> that initial username and password, --> and have it readily available. --> So what I did is I just created a new text file. --> I copied and pasted the username and password, --> and then I'm going to just save it as text, --> and I'll just throw it in my downloads. --> And I'll just name it up in my downloads. --> Perfect. --> So now, you know, I've downloaded NAFA, --> I've extracted NAFA, --> I've double-clicked on run NAFA, --> it went through, it created everything it needed to do --> to get up and running, --> and then, you know, it's up and running now, --> so it's just waiting on me to log in. --> So what I like to do then is I'll bring up my browser, --> and, you know, I like to go, --> if you remember, the IP address was 127.0.1, --> which is localhost, --> and we were on 8443, that port, --> so HTTPS, because it's secure, --> and colon 8443. --> Now, you'll learn that you need to do dash NAFA, --> but to show you what happens, --> let's just go to this one. --> And like I said, the initial running of NAFA --> can take a few minutes, --> so if you are following along, --> and you're trying to do this, --> and you're getting page not found, --> then, you know, I don't know, --> but it also helps that I put in the right port, --> 8443. --> But again, you can put in the correct IP address, --> the correct port, and it's still not load. --> On the last class, I noticed, you know, --> even three or four minutes --> before it was fully up and running, --> even though NAFA would report that it's running, --> it still took three or four minutes to initialize. --> Again, we're working in a high-latency --> virtual desktop environment, --> and so your own environment --> may be much better or different --> to allow that to run. --> So anyways, I'm at 127. --> It's going to come back and tell me --> my connection is not private. --> It's a self-signed certificate, right? --> All this was set up just to add --> that username, password, security layer. --> So what I like to do is I'll go advanced, --> and I'll go ahead and proceed. --> And then I didn't specify slash NAFA, --> but it caught it. --> It's automatically going to redirect me, --> and now I will be at the login canvas. --> So it's asking for a username and password. --> I have it right here, luckily. --> That's why I said, you know, --> copy and paste it when we get to that part, --> when we go through this more hands-on. --> Make sure you copy and paste it --> into something a little bit easier. --> That log is going to go away, --> so tomorrow when we log in, --> if you did not copy and paste it somewhere, --> you're going to have to find that old log, --> and we're going to have to get it. --> In the username, the password, log in. --> Perfect. --> We are now back at the application. --> So this is the NAFA application. --> It is web-based. --> You know, there's a lot of buttons --> and a lot of things, --> and we're going to go over every one of those. --> But again, it's a web-based application. --> You know, there's some server technologies --> under the hood that's running this, --> you know, to JD and some other things. --> But, you know, it's all browser-based, --> mostly to work with the data flows. --> But again, there's no point-and-click, --> you know, properties manager, --> so you've got to, you know, hand-edit that. --> You know, a lot of applications, --> you know, you're going to have to edit the properties. --> But once you get it up and running, --> you shouldn't need to go back to the log directory --> or any of those other properties --> unless you, like, have a warning or an error --> that you need to look at in the log directory. --> But if you're running this as a standalone, --> in your spare time on your laptop, --> you know, even at work, you know, --> you probably don't need to go back --> and take a look at those, --> but make sure you keep that username and password. --> So we're logged in. --> I can actually now start building my data flows. --> But what I'm going to do is actually go back --> in my presentation, --> where we talked about some of the core components of NAFA. --> So we talked about processors, connections, --> flow files, flow controller, --> all of these things that we talked about. --> And let's take a look at them. --> Let's look at them. --> Let's, you know, see more about what they are in NAFA. --> So what I like to do is this is your canvas. --> This is a blank canvas. --> So you don't have any processors running. --> You don't have, you know, any of the process verbs. --> You don't have any data flows or anything else. --> You know, you don't have any of that. --> So, you know, it's a blank canvas. --> So this section up here, --> you can see the NAFA logo. --> You know, oh, I want to point out, --> there's my banner that this is a test system. --> So I can put in capital letters unclassified even, right? --> Or I can put dev or test. --> And that property, when NAFA has started, --> it's going to read that property --> and put that as the banner. --> So anyway, so the, you know, this is the main canvas. --> MIUI has multiple tools to create and manage, --> you know, your first data flow. --> So what this is is the components toolbar. --> So if you see, you know, you should see processor. --> You see input port, output port, process group. --> If you just hover over them, remote process group, --> funnels, templates, and labels. --> So, you know, the last group, --> we actually, I did not mention filter or funnel on purpose. --> And the last group was able to actually work it in --> to their, you know, their data flow --> as it was pretty understandable. --> They just referenced the document. --> But anyways, this is your components bar. --> Now, right below your components bar is the status bar. --> So, you know, how many bytes are going in and out of the system, right? --> How many processors are started? --> How many are stopped? --> How many are disabled? --> You know, how many have a warning? --> You know, all of these things. --> Now, the canvas itself only updates automatically every five minutes. --> But at any time, when I, when, you'll hear me say this a few times --> during the, when we're building a hands-on data flow, --> is to go ahead and refresh, you know, your canvas. --> So, when I say refresh your canvas, --> that doesn't mean, you know, go up here and refresh from the browser. --> That's actually just anywhere on this canvas --> without clicking on any component, --> you can hit right-click and hit refresh, --> and it will automatically refresh the stats. --> But anyways, so that is your status bar. --> This is our operate palette. --> You know, and we'll go more into that. --> But that operate palette allows me to control that whole process group. --> And so, if I have a process group right here, I can start, stop, --> I can enable, I can disable, I can, you know, --> I can adjust the properties and those types of things --> right here on my operate palette. --> And so, you know, when we build our data flow, --> we are actually going to create a data flow. --> And then afterwards, we're going to put that data flow --> into a new process group --> to get ready for some additional hands-on data flows. --> And so, we'll go through how to do that. --> But once everything is up and running, your data flow is going, --> you know, you have that capability to just click on that process group --> and say, stop. --> You know, I want to stop the whole thing. --> So, you know, you do have that. --> And some of the other parts and pieces of the NiFi Canvas --> is the global menu. --> So, that's right here. --> So, you know, you have a summary of your data flows, --> how much data is coming in. --> You know, a lot, what you see on the status bar, --> but a lot more detail, --> as well as counters and a bulletin board --> in case of, you know, any kind of messages there. --> You have a new section, --> another section called data provenance. --> So, you know, that way, --> right now we have zero data provenance. --> So, if I click on it, it's going to show zero events --> just because we have yet to do anything. --> But later, we will actually go to the provenance. --> We will dive into that, --> and that's where we are going to be able to replay our data, --> look at the lineage, those types of things. --> You have controller settings. --> We'll go into controller settings, --> but, you know, I mentioned what a controller is already. --> You know, a controller is, you know, that reusable component. --> So, you may have a controller --> that provides a connection to SQL Server, --> and you really don't want to share the username and password --> to everybody that needs to connect to the SQL Server. --> So, what you're able to do is actually create --> a new controller service for SQL Server --> where your sysadmin plugs in the correct information --> they need to connect to that database --> and push data to that database. --> But, you know, you don't want to have your username and password --> running around to just anyone. --> So, a sysadmin can create a service --> that is a SQL Server connection service. --> And so, now, when I build my data flow --> and my colleague builds a data flow, --> you know, you may have a whole team, --> but everybody's having to write back to SQL Server. --> They don't need to worry about the connection details. --> Where is that SQL Server at? --> What port is running? --> You know, the IP address. --> They don't need to worry about username and password --> unless you set this up for them to specify --> that username and password they connect with. --> But you don't need to worry about that. --> There's a few things that once you set this up, --> you know, everyone, if they have the security permissions, --> can access that service. --> So, what they would do is just reference that service --> in their data flow when they get it built. --> We are going to build a CSV service to read CSV. --> We're going to build a JSON service --> to read and write JSON documents, --> as well as a controller service for our registry --> and a few other things. --> But, yeah, so that's data provenance. --> That's controller settings. --> You have parameter context. --> So, you know, you may put a parameter in. --> And depending, you know, in your data flow, --> you could say something like, you know, --> you have a dev parameter, a test parameter, --> a prod parameter. --> And, you know, you may have dev as an IP of 1.1.1.1. --> And test is 2.2.2. --> And prod is 3.3.3. --> So, you have that key value parameter --> that you can reference in your data flows. --> And so, this is a global parameter that can be used. --> And then that way, I can say, you know, --> connect to dev instead of having to put the IP address --> and things like that into the processor. --> You know, I need to say dev, --> and it will automatically know the IP address --> that's associated with that because of that parameter. --> So, that's where you would set your parameters. --> Flow configuration history gives you the history --> of your flow, like data flows. --> We'll go into that so you can see, like, --> when we make changes to things or add, you know, --> it keeps a history of that as well. --> Node status history is basically how many bytes --> are coming in and out of this node. --> How about if you have, like, site-to-site setup --> and some other clustering technologies set up for this, --> you may want to see what your node 1 is doing. --> You may want to look at node 12, you know, --> those types of things. --> So, this gives us our status history. --> Templates, I've went over templates a little bit already, --> but, you know, templates are there for, you know, --> re-usability, share things with your colleagues. --> You know, you can build a data flow, --> save it as a template, export that template out, --> and send it to a colleague. --> They can import it and, you know, run that same data flow. --> You have help. --> Like I mentioned, this documentation should look --> very, very similar to what's online. --> The reason that it is, you know, --> because, you know, the documentation is shipped --> with NaFa, so a lot of times we have, --> you know, closed systems, you know, --> we have one-way transfers and things like that. --> We have, you know, you have systems --> that don't ever touch the internet, --> and they're on, you know, their own closed network. --> So, you know, you may not have access to the internet --> in your NaFa instance, and because of that, --> NaFa ships with all of the documentation you see online. --> So, as new releases of NaFa come out, --> the documentation has to be updated as well --> for it to be a proper release. --> And so, you know, when you go to help, --> you're going to be able to go through the documentation. --> You know, if you want to do a delete DynamoDB processor, --> right, and you need to understand, --> you know, the properties and things like that, --> you know, here it is, --> without ever having to go to the internet. --> And then, of course, you have a balance. --> A balance is easy. --> So, this is version 126.0 of NaFa. --> It was built on May 3rd. --> It was tagged as release candidate one. --> And, you know, the branch and everything else, --> you know, you can actually pull a lot of information. --> So, you know, again, if you go to GitHub, for instance, --> I wonder if it'll let me search. --> And it starts NaFa. --> NaFa's GitHub repo, --> where all the NaFa source code is located here. --> And so, you know, this is the main branch, --> but, you know, you can go through --> and see all the different branches, --> release candidates, those types of things. --> Here is the source code to all of NaFa. --> So, not only can you download it from that link earlier, --> you can do a Git clone if you are familiar with GitHub --> and Git and others, --> clone this and build it yourself as well. --> You know, so, you know, just keep that in mind. --> Again, it's very open. --> It's very well supported. --> There's a lot of documentation for it --> and things like that. --> So, that's the help section. --> So, that is an overview of the canvas --> and all of the components on the canvas. --> And so, before I start diving into, you know, --> some of the finer workings of NaFa, --> I want to pause there. --> Is there any questions I can answer up until this point? --> Well, hopefully, I'm teaching so well --> that it's very clear and understandable. --> I always worry about my southern accent, --> you know, playing a part in this. --> So, but again, if you have a question, --> feel free to interrupt me --> or you don't, like, I need to translate something --> or speak proper English, you know, --> just feel free to yell at me. --> I got one quick question. --> Yeah, go ahead, Tom. --> I'm understanding that when you run the command, --> I do see that, and I've run this on a container before, --> so I've seen during the execution of the, --> you'll see the password in there, right? --> But I'm sorry, I missed a part where, --> let's just say you don't see that here. --> How do you, is it written in the log --> or you can go and retrieve that username and password? --> No, that's a great question. --> So, when we start NaFa, --> it's automatically going to create this log --> called nafi-app.log, --> and that's where almost, --> that's where 99% of NaFa activity is writing to this log. --> And so, yes, on that first install, --> you're going to see, you know, --> generated username, generated password, --> and it's only going to be in the logs. --> The problem with, do I? --> No, I was just saying, yep, I see it there. --> Okay. --> You know, the problem with that is, --> we're going to start doing some hands-on exercises --> and some work here, --> and so that log is going to roll over, --> and so it's going to rename this old log, --> give it, I think, a date at the end of the log, --> and it's going to start a fresh one. --> And so, if you do not capture --> your username and password pretty quickly, --> it's going to be in another log, --> and it could be in, you know, --> a log that was generated days ago, --> if you didn't, you know, set everything up, right? --> Or, you know, you may go in and put a data flow in --> and run it. --> It's generating all these log messages, --> and now your username and password --> is sitting in a five-day old log file. --> So that is where you initially --> get your username and password, --> but do know that it can go away, --> especially if we're doing a lot of operations --> very quickly. --> Great question. --> And then we are going to, you know, --> go through installing and getting it up and running --> and all that fun stuff. --> If you didn't follow along, --> you know, I like to kind of go ahead --> and show you what we're doing, --> and then that way we can all hands-on. --> That's where we're going to get our username and password. --> You can find that log in the logs directory, --> and it's 95-app. --> Let's see if I do it right. --> Yeah, I haven't generated enough data --> for it to roll over, but, you know, tomorrow, --> I bet there's going to be a 95-app.log --> 5, 21, 2024, something like that. --> Okay, any other questions? --> All right. --> So what I'm going to do is kind of go through --> the more in-depth of the components --> and, you know, go through some of those things. --> We will then take a break and go to lunch --> and come back from lunch --> and, you know, get everyone else up and running --> and, you know, get your own version of NaPhi going --> so we can start building some data flows. --> You know, so that being said, --> you know, on the components toolbar, --> the first thing I have is processors. --> So I actually just click that --> and hold it and drag it down, --> and here are all of my processors. --> So, you know, with this version of NaPhi, --> this install, I have 359 processors available. --> So, you know, I have processors to handle Amazon, --> Azure, you know, AWS tags, JSON, CSV, --> you know, all kinds of things. --> So what you're seeing here is just like a word cloud, --> you know, from all the processors --> and those types of things. --> So then you also have, you know, --> a list of all your processors --> and the, you know, the description. --> Just because it was asked the last time, --> you will see the shield, --> the little red and white shield beside the processor --> is specifically called out --> because you can now create a policy and security within NaPhi --> that will allow you to lock down certain processors. --> You know, so for this one, --> this is a reference remote resources processor. --> So it falls within that reference remote resources. --> And so because of that, --> you may set a policy that says, you know, --> my data engineers cannot, you know, --> see these processors in this group --> because, you know, they're not needed --> and, you know, for security reasons, you know, --> we're just not going to allow that. --> Or you may have it where, you know, --> I have database admins that need access to this group --> that contains the database connection details --> and those types of things to set it up. --> But another group doesn't have access to it, --> doesn't need it. --> They can just reference, you know, --> that processor from a controlling service, right, --> you know, so that is a reason for that little shield. --> But anyway, so all of these are processors, --> 359 processors. --> And the one I like to really start with is a Git file. --> So, you know, as you can imagine, --> there's 359 processors. --> It's going to be hard, you know, --> I can scroll through this to see Git file. --> Sometimes I'll skip over it. --> But you can use the filter. --> So I can do Git F and I can do a Git. --> So I'm going to pull Git FTP or Git file. --> You know, that's a really nice way to narrow it down. --> I can use the, you know, little tag cloud here. --> And I want to see all processors with Git --> in the description, right? --> And there should be my Git file right here. --> So, you know, that's how you would select the processor. --> So what I like to do though is, --> I don't even know the name of the processor, --> so I'm going to say Git file. --> I see it. --> It's highlighted. --> I say add. --> Boom. --> New processor on my canvas. --> So this processor is just the Git file processor. --> You know, it's got a single function. --> Its function is to pick files up --> and retrieve those from the file system. --> You know, it's not trying to extract things. --> It's not, you know, doing any kind of ETL. --> It's not a model or anything else. --> This processor is doing one function and one function only, --> and it does it very well. --> And that's the Git file. --> Also, within a processor, you can see again that little shield --> that belongs to a group that, you know, --> you can imagine you may have a convert text processor, right? --> You know, from a security aspect, that's a very low risk, --> you know, just because you're converting data --> that you already pulled in --> and you're converting it to other formats and sending it out. --> But, you know, you're not, you know, this, you know, --> a convert text processor, for instance, --> it doesn't have the connection details. --> It doesn't, can't get a file. --> It can't put a file. --> It can't connect to a database or anything else like that. --> So because this one can actually get data, --> you know, there is a security group for it. --> You may want to, you know, depending on your security policies, --> you may want to lock this down where, you know, --> folks can't do a Git file or a put file. --> You know, they can build in the logic of the data flows --> and everything else, --> and they may get their data from another processor. --> And then that way, you know, you run the risk of, --> you know, someone doing a Git file. --> We actually had this happen on the last class --> with a couple of people --> where during the exercise we put Git file. --> They specified the same directory as NAFA to Git. --> They told it to not keep the source file. --> And so they also told it to ingest everything. --> And so what they did is they built a flow --> that did a self-destruction. --> And so what it did is, you know, they run that flow. --> That file went and grabbed everything in the directory --> and, you know, of itself passed that data to the next flow file --> and then it crashed because, you know, it just couldn't work --> because it consumed itself. --> And so, you know, there is some security thoughts --> that go into this, you know, --> as you're planning this deployment out. --> But anyway, so that is our Git file processor. --> You know, you can take a look at it. --> It's going to give you some real quick information. --> How many bytes came in? --> How many bytes read and write? --> How many bytes went out? --> And how many tasks and the time it took to execute those tasks? --> All of this, again, is in the last five minutes. --> But if you hit refresh on the canvas, --> so I click off of that processor and hit refresh, --> if data was flowing through, that would be updated. --> And so, you know, that's how you would get a quick refresh --> of what's going on with that processor. --> Now, every processor, you should be able to click on it. --> It will do a little black box around it to highlight it --> and then right-click on it and you have options. --> So the option that we will use most is probably configure, --> so we can actually configure the processor. --> You know, there is a disable. --> If you want to disable it, --> you want to view the data provenance for this specific processor. --> You can replay the last event through the processor as well. --> You can view the status, the usage, its connections. --> You can center it in view. --> You can change the color of that processor. --> So, you know, we're going to get into, you know, some of this. --> But just for FYI, you know, the hands-on exercise, --> one of the things I look for is some of these, like, you know, --> coloring, you know, labels, naming conventions. --> You know, some of these types of things that are very non-technical, --> but, you know, I look for those just because of usability, --> ease of use, and those types of things. --> So anyway, so that's my Git file. --> I have my configured, disabled provenance. --> I can group them. --> I can create a template. --> I can select multiple processors and create a template. --> I can copy it and paste it. --> I can delete it. --> But for this scenario, I want to say configure. --> So this is how I configure that specific processor. --> You know, it has a name, Git file. --> Now, you know, I don't like that Git file name because, you know, --> it doesn't tell me a whole lot. --> If I had a data engineer looking at my flow, you know, --> I want them to be able to look at my flow, --> quickly understand what's going on and how this maps together, --> and that way they can accomplish the task that they need to do. --> So what I like to do is I go into my name, you know, --> during the configuration and I'll say Git file from system. --> So there we go. --> That is an easier, more human readable description of what --> this file, this processor is going to do. --> Also, you know, if there is a penalty or error or something --> else like that, it will penalize the flow file. --> And this is basically the duration is how long you want that penalized. --> So right now it's set, everything is default to 30 seconds. --> After 30 seconds, it's going to retry and reprocess that flow file. --> But, you know, 30 second penalty. --> The bullet level, you know, --> what kind of logging do we want from this processor? --> You know, we may, we may, you know, --> the bulletin level is set to warn. --> But if you want to log everything, you may put it at debug. --> Most times you keep it at warn or error. --> And so what that means is if this processor has a warning or error, --> it is going to push that to the NiFi dash app log. --> You know, in that area. --> So, so, you know, if you're building a flow file for your first time, --> you may put debug and that is going to log everything. --> You usually do not need that much detail, but, you know, --> it's there in case you need to set around one, --> but in about 15, 20 minutes. --> Okay. And then you have yield duration, --> just how long that that is going to yield before it's scheduled to do it again. --> You know, so, so one second is pretty standard. --> Again, you may change these settings when you start building your own data flows, --> you know, at, you know, real world. --> But most of the time these, these properties all stay the same, --> except for the name, you know, the name part of this scheduling. --> There's a couple of scheduling strategies. --> There's a timer driven, a cron driven. --> So you can set this, you know, most all processors default to a timer. --> So it's going to run every, you know, it's running constantly. --> So you can actually set a run schedule. --> It says, Hey, I won't run this processor every one second. --> I want to run this processor every 10 minutes or 10 hours. --> You know, so what it will do is that scheduling strategy is going to, --> is going to dictate, you know, the running of this processor. --> You may have a cron where it runs, --> this processor runs only between 10 p.m. and 11 p.m. --> with a run schedule of every one minute. --> And so, you know, it's going to run 60 times during that hour. --> You may have the concurrent tasks is how many tasks. --> So this processor is doing a get file from, and it's running one task to get file. --> Now, one of the things that I had the class do last time is, --> is actually pick up and get file, pick up that 1.5, --> 1.2 gig zip file and decompress it. --> And so we had a few folks where the file got duplicated or they picked up everything. --> And so, you know, what happened was it kind of slowed the system down. --> It was taking a while to pick things up and send them off. --> And so, you know, because it was processing large amounts of data. --> But if they wanted to make that quicker, you know, the run schedule is already running full speed. --> So but if they wanted to make that quicker, they could have put the concurrent tasks at five. --> We gave five concurrent tasks to execute this. --> Property. So this is the big one. --> This is this is the configuration for the processor itself. --> And so, you know, if it's bolded, it is a required field. --> So, you know, for this processor, the get file processor, of course, it needs to know where to go to get that file. --> So it has an input directory right now. --> It's blank. And I need to feed it a value. --> So what I like to do is go right here. --> You know, and again, we're going to go down, you know, through this. --> Let's see what I can do is pull. --> We'll reuse this sample data I have for another scenario later on. --> So what I like to do is put this here. --> All right. --> So I have now a folder with data sitting in it. --> And so let me go up here. --> Here is my file path. Right. --> It's in C colon user student downloads weather data. --> So I actually just take that and copy it. --> Paste it in. --> Say OK. --> So that is the input directory for my get file processor to to get in spots. --> You know, we'll go into more detail when, you know, we're building some flows. --> But one of the things that, you know, people up all the time is it says keep source file and they put false. --> I'm going to change that to true. --> I want to keep that source file exactly the way it is because, you know, I don't want to take a chance on picking that file up, sending it to another processor and doing these different operations. --> And then somehow I've messed something up. --> Well, now I don't have the source file because, you know, I told it no to keep source file. --> It went through a process and it's now corrupt or, you know, I didn't do something correct. --> You know, one of those types of things. --> So, you know, from the beginning of creating a flow, a data flow, I like to keep the source file. --> Once I've tested this and I'm very sure that it's working and I don't have a requirement or a need to keep the source file, I'll turn that false. --> But for this exercise, we're just actually going through the components of a processor and the menus and stuff like that. --> We're not really worried about data flow building right this minute. --> So anyway, so there's other, you know, properties for this get file processor. --> I can tell it to filter on the file. --> So only when I can put a filter in to say only pick up CSVs. --> How, you know, the polling interval. --> Do I want to recurse some directories? --> You know, is there a minimum or maximum file size or age? --> You know, those types of things. --> You know, so again, I am now bringing in data. --> I've got a processor that's getting a file based upon what I tell it to do in the configuration. --> Bring that file up and ready to send it to a next processor. --> And I did not have to write any code. --> I was pointing and clicking and filling in properties and calling it a day. --> So that is the properties. --> All processors have a relationship. --> So, you know, that's either the relationship for a get file is usually success. --> If it can't get the file, like, you know, it doesn't have permission. --> It doesn't know. --> It doesn't have only is going to read files that it has access to. --> Now, you know, so it doesn't really have a failure path just because that processor just is pulling and pulling and pulling. --> Some of the other processors have different relationships. --> One is success. One is failure. --> You may have a relationship that sends the original document to another processor, --> and then it takes the extracted information from that document and puts it to another processor. --> So there's a lot of power there, a lot of capabilities that will go into. --> But that is the reason for relationships. --> And then there's a comment section, right? --> So, you know, think of this, you know, from a software engineering aspect, --> when you are committing your code back to a repository like GitHub or GitLab or something, --> you need to make a comment on your code. --> Same thing here. --> What is this processor doing? --> Like, what, you know, give me the, you know, who built this? --> You know, you may have a policy set up that, you know, you need to put in all this information. --> That's where some of this would go. --> So I'm going to just say this is a test processor and leave it at that. --> Apply, done. --> And so I have now, you know, created my first processor, dragged and dropped it down. --> I have, you know, built, you know, started building my data flow. --> And I gave it a good name that I can understand. --> Git file from the system. --> I even put a color on it to distinguish it between other maybe Git files. --> You know, and I went in and configured it. --> Now I just need to start building my data flow. --> And so, you know, to do that, right, you know, once you get your file, --> you may bring down another processor that, you know, identify my file type. --> And this is where that relationship comes into play. --> Because now I can take this, drag my arrow to my next processor, --> and for success send it there. --> Done. --> And once that processor is configured properly and has the connections it needs, --> it goes from a yellow yield to a stop, you know, red square. --> You know, and it's ready to run. --> If you hover over the little yellow yield on your processor when you get to start building them, --> it's going to tell you why it's not ready. --> In this instance, the relationship success is invalid --> because relationship is not connected to any other component. --> So that tells me, like, I'm going to have to put another processor to send this to --> once it identifies the MIME type, the file type, right? --> We will go into more of the flow building. --> This is mainly to point out the components and the user interface part of NAFA, --> you know, just so, you know, everyone understands. --> After lunch, you know, we'll do that. --> We're almost close to breaking for lunch. --> But a couple more things I wanted to go over before we did that. --> So that's the processor. --> You have input port and output port. --> So you may have, you know, data coming from a process, a group of processors, --> and you are sending all of that to an output port. --> And then you may have another process group with a bunch of processors living underneath it --> with an input group. --> And so, like, you are pushing data out to an output group, --> and then you are able to receive that data from an input group. --> So that way, it helps you manage, you know, those data flows. --> And, you know, depending on some rules or something else, --> you may have different input-output ports. --> But that's the reason for an input-output port. --> So you got input and you got output. --> Next is a process group. --> And that's what we talked about. --> So process group is, you know, what it says. --> It's a group of processors. --> So I am saying for this process group, get files. --> Something easy. --> And so get files. --> I can then put all my processors in this. --> I can also put an output port in this process group. --> And I can have it go into another process group through an input port. --> But once you have a process group, you just double-click, --> and you can go in. --> One thing you notice is there's a breadcrumb trail. --> So there is, you know, the main canvas I have is NaFi flow. --> But once I get into my Git file, I'm now, you know, just a level deeper. --> And the way NaFi handles security is, you know, for instance, --> this whole canvas and everything associated is the root level. --> The root level has a unique UUID assigned to it. --> So, you know, I don't think anyone here, I didn't hear anyone saying --> they're having to set up some of the security stuff. --> I do know Brett and Ben and some others are working on this. --> But, yeah, if you were to define a policy for NaFi, --> you would say this is the root canvas and only, you know, --> group A has access to that. --> And then, you know, group B has access to the Git file processor group --> where you can get all the files. --> And so, you know, that Git file processor group with, you know, --> its own set of processors, let me just add one to me. --> You know, you may have a policy that says that they can access that. --> Each one of these has a UUID. --> So the main, we're in the 7DC, we're going to Git files, 389. --> So, you know, the UUID is going to change. --> But based upon your security, you can lock this down to process groups. --> You can also allow people, you know, everyone to have access --> to the main canvas, depending on what you all want to do. --> So that is a process group. --> A remote process group is exactly like I said. --> It's going to pull in a process group from a remote NaFi. --> So what you may do then is you may have a process group --> that lives on another NaFi instance. --> You can pull that in and run that remote process group. --> You have funnels. --> Funnels kind of, you know, just like the name says, --> it just funnels some of the data together. --> It has some configurations and things like that that we can work off of. --> But, you know, that's the point. --> Template. --> So if I were to hold my shift key, I can highlight this whole thing. --> And then I can say actually right here from my operate canvas, --> I can say create template. --> I can say Git file. --> Give it a description. --> And create a template. --> Now I can actually take my template, I can drag it down, --> and now I have a Git file template. --> And I can say add, and it's going to lay out that Git file template. --> Also, I can download that template. --> I can go over here to my templates. --> It now should show up, and I can download it or delete it. --> And so in this case, I am deleting it. --> Okay. --> So that's templates. --> The last component that we will go into before lunch here is the label. --> So again, when I created this processor, I configured it. --> I went right-click. --> Let me get off of this. --> I went right-click. --> I went to configure. --> Come on, latency. --> And under that, I gave it a very, you know, human-readable, --> understandable name. --> But what label does is drag this down, and I can now, you know, --> create a box. --> You know, think of this as almost like a PowerPoint kind of --> capability. --> And then I just double-click and say this. --> I'm going to call this a test group. --> And so what that label does is I'm able to open up NAFA. --> I can look at this, and because of the labels, you know, --> I may have a label that, you know, that is, let's see if I can --> paste that. --> There we go. --> I may have processors, all part of this process group, --> and I have processors. --> So the first label is picking up the file and identifying the --> type of file. --> I may have another label that, you know, --> is processors that is handling the file type. --> So if it's a zip, it will unzip or something else. --> And then I may have another, like, set of processors that I've --> chained together that's doing another function. --> So what this allows me to do is, you know, kind of drag, --> like, you know, maybe, you know, imagine this, identify mine --> type as its own little category. --> So I can actually put these, you know, labels on either a --> processor or a group of processors or even a process --> group, and that way I can quickly look at this and say, --> okay, you know, here's where they're getting the git file --> and what's happening. --> Here's where they're actually doing the ETL or, you know, --> those types of things. --> So that is the purpose of the label. --> That is a lot to ingest. --> But with that being said, I'm going to pause here and see --> what questions we have before we go to lunch. --> When we come back from lunch, we will make sure we have --> our NaPy installed and up and running. --> So if you did not do this while I went through it, --> it's perfectly okay. --> We're going to get it installed. --> We're going to get it up and running and start creating --> our own flow file within our own data flow. --> While we are creating that, we will also visit, --> you know, some menus and components that, you know, --> I haven't already went over because now we're at a --> point where, you know, let's just get it going and --> we've got a processor, our first processor deployed, --> and, you know, now we can learn some additional concepts --> as we build out this flow. --> But any questions before we go to lunch? --> Okay. --> I have a question. --> Yes, go ahead. --> It's actually another question. --> So I can't type on the remote desktop. --> So is there, did I press something that's prevented --> me from typing? --> Let's see. --> You shouldn't have any issues typing. --> Let me pull yours up right quick. --> Okay. --> And then when you click on the, like, --> click on the browser itself or the toolbar, --> like click on this toolbar and start typing, --> can you now try to type? --> No. --> It will not let you. --> Huh. --> Do I just need to reconnect? --> I will probably maybe stop the machine. --> There we go. --> There's Maria. --> There we go. --> Let's see. --> All right. --> So while we're at lunch, if you can, --> just do a regular, can you still click to shut down? --> I can. --> I can click. --> My mouse works, but my keyboard doesn't. --> Okay. --> So that's really weird. --> Yeah. --> If you can, just shut it down and restart it. --> It may be a proxy configuration or something like that. --> We've run into a couple of those types of issues, --> but you should be able to type. --> But no, thank you for asking. --> Thank you. --> And then any other questions? --> All right. --> Well, if there is no other questions, --> let's take about a 45-minute lunch. --> What I will do is I'm going to continue sharing my screen, --> and let's just do an afternoon. --> Well, it is afternoon for me. --> And we will return back here at, --> it is 11.45 your time. --> We are going to return back here at 12.30 your time. --> 2.30 my time. --> 2.30 p.m. --> 2.30 p.m. CST. --> Okay. --> So if there's no further questions --> or anything I can help with, --> go have a great lunch when we get back. --> We are going to be very hands-on. --> We're going to start building some flows --> and things like that. --> And we're going to be doing a lot of that --> over the next few days. --> And so if there is a use case or a scenario --> you kind of want to play on, --> I'd be happy to integrate that in. --> I have a weather station scenario --> where we pull data from three different sources. --> Two of those have different data formats. --> We need to get it into the same format. --> We need to do some reporting on that data. --> So that's one of those scenarios I like working with. --> But, you know, with that being said, --> if you have anything specific --> that you would like for me to tailor this conversation with, --> I would be happy to help. --> And if there's no other questions, --> I will see everyone back in about 45 minutes. --> Thank you. --> Thank you. --> Thank you. --> Thank you. --> Thank you. --> Thank you. --> Hey, just checking in real quickly on the 3-tier app. --> Rhonda mentioned this morning --> that you thought it might take a long time --> because it references AWS resources or something. --> Oh, okay. --> I told her it should be pretty quick, --> but... --> Okay. --> Perfect. --> Reply to the email thread on a separate one. --> I thought you were reading this. --> I guess not. --> But he's saying once a day, --> based on a separate VM, --> he started setting that up. --> I said, yeah, I mean, I can move this thing --> from a whole three of them being on a separate VM --> if he really wants, --> and I can apply it to the rest of the React when I go. --> Yeah. --> That's why I'm the one that suggested this app --> because I was like, --> me, you, and a few others could easily get this up --> and run it quickly, I feel like. --> And so, yeah, it shouldn't be hard --> to move it to a different thing, --> but can you include me on the response --> on that other thread? --> Because it's kind of hard to keep track --> of these things if I don't see it. --> Yeah, yeah. --> I don't know. --> Well, he's just getting ready for his demo. --> The future demo's coming up. --> Well, no, I'm getting ready for demos --> for when it comes time to do the demo. --> Yeah, yeah. --> Because he just wanted three tiered app for that, right? --> Yeah. --> And that's why him and I and Rhonda met last week, --> and I was like, look, --> well, a few of us met, right? --> And then I was like, look, --> I can go to GitHub, I can find a three tiered app --> in a language that most of us can work with and know --> and should be pretty quick and easy to get up and run it. --> So, yeah, good, good. --> Because Rhonda called me this morning and said, --> and I think this was before you even got into the office, --> it was early this morning. --> So she was like, you know, --> feeling thought that there was some AWS resources --> and this was a lot harder. --> And I was like, well, when I looked at it --> and I looked at the source code, --> I was like, I don't think it will take long, --> but I will call Dylan later today to find out. --> But now that I called Dylan --> and he's already got it up and running, --> that's even better. --> What I said was, --> let's see what I can say. --> Oh, I can't remember. --> This was... --> Oh, yeah, yeah. --> And he left the external IP address --> for some AWS EC2 instance, --> and I put it in there. --> And I haven't had the screenshots with pictures. --> This means it's always hard to contact the server. --> One day I will call on the other instance --> and I'll put the link to that. --> Yeah, well, we both know, --> we're both working with the same person --> and I completely understand. --> So... --> Awesome. --> Okay, you rock. --> Thank you. --> Yeah, you're doing good otherwise. --> Yeah, yeah. --> When's the baby going to be here? --> October 11th is my wife. --> You know, her mom's birthday is Wednesday, --> but we'll see. --> And this is her first? --> Yeah. --> All right, so I am banking on the 15th. --> Think it's going to be late? --> Maybe. --> Usually the first ones have a tendency --> that can go a little later. --> But it definitely won't be too much earlier. --> Usually. --> Now the second one... --> Oh, you're going to have it early. --> Dude, they're going to have six kids, right? --> Like, and I delivered... --> I delivered my next to the... --> I delivered Natalie, my nine-year-old, --> until her shoulder got stuck. --> Oh, wow. --> Yeah, and I had requested to deliver. --> They let me. --> Really? --> Yeah, I wanted to deliver my own child --> and that was my plan. --> Until, unless something came up --> and then the doctor was right there. --> So when something didn't come up, --> I just removed my hands --> and I got out of the way --> and let him do their thing. --> So, all right, well, good luck. --> Like I said, I'll be up there in about a month, --> so I'll see you in a little over a month. --> Oh yeah, dude. --> I'm looking forward to it. --> How long are you here for? --> A few weeks, actually. --> But I'll be in the technical office. --> You know, I'll only offer a few days. --> Okay. --> All right. --> Hell yeah. --> All right. --> Well, if you need anything, let me know. --> All right. --> Oliver! --> Ollie! --> Oliver! --> Come here! --> Mario or Pokemon? --> Well, okay. --> These are the high top. --> This is not a high top. --> You like this one? --> I like the Pokemon. --> Okay, so not this one? --> How about this one? --> Do you want the Pokemon? --> Do the Pokemon. --> How about this one? --> Rocket. --> Rocket! --> And he has a Zip on the back. --> Zip? --> I want the Zip. --> I want it too. --> No, just beat one. --> So Rocket? --> Rocket. --> I don't want those. --> Can I have a little? --> Rocket. --> Or... --> Mario Brothers, Pokemon, or Fire Shoes? --> Pokemon! --> Huh? --> Pokemon! --> Pokemon? --> This one? --> This one has a little Zip on the back. --> Okay, stop. --> Why do you keep scratching? --> Let me see your hand. --> Why do you keep... --> Why do you... --> Why do you keep scratching? --> Why do you keep scratching? --> Where's the... --> We need a loop there. --> Okay, stop it! --> Did you brought me a coffee? --> You said you like this. --> I definitely need this. --> Well, hopefully everyone had a great lunch and work and all coming back.