39:11
2024-05-06 08:57:50
3:11:28
2024-05-06 10:30:19
24:39
2024-05-07 07:20:20
19:49
2024-05-07 08:03:25
1:14:35
2024-05-07 08:35:13
15:41
2024-05-07 10:06:25
38:33
2024-05-07 10:37:03
2:48
2024-05-07 11:19:01
59:37
2024-05-07 11:33:56
6:10
2024-05-07 14:55:25
39:40
2024-05-07 15:02:44
9:45
2024-05-08 06:44:21
29:27
2024-05-08 08:09:24
2:51:40
2024-05-08 12:09:24
Visit the Apache Nifi - GROUP 1 course recordings page
WEBVTT--> So for those that are on your desktop if you can can you bring up your --> Desktop, you should see a blank canvas --> You know a blank desktop like mine. You'll have Microsoft Edge, Docker desktop --> Uploads, you know those types of things --> But one of the things I want you to check is if you can open your file explorer --> And --> When you go to downloads --> There should be --> Quite a bit of downloads, minify, minify, nine five registry, toolkit, nine five, those types of things --> If you do not have that information let me know --> That way I can go in and replicate to your desktop the you know the information. CodySCE you have it --> Aaron, nothing --> Sean, Brett, --> You look good too --> Amanda you should have it as well. I remember replicating your desktop. Perfect. Perfect. Perfect --> Pedro, are you able to see the files? Back to Pedro. Alyssa, if you look in your downloads --> File Explorer then go to downloads --> You should see --> a list of files --> No worries, I --> Again, I'm here on my ranch in Central, Texas. So --> Elon Musk, Starly --> You know if he decides I want data today, I get it --> Okay, you look good. Pedro, Alyssa, Randy --> Being yours is good. Pedro, are you able to --> Yep, I see your screen. You look good to me. Yeah, perfect. Perfect. Perfect. So the --> Downloads all worked and good to go there --> So what we'll do is go through some of the PowerPoint presentation a little bit more --> We'll take a break for --> Just take a quick break. I like to do like a 10 to 15 minute break --> You know depending on on how many questions we get and those types of things --> It's my understanding --> Everyone's in Arizona. So my lunch time is usually in about an hour --> But for Arizona time we will try to go lunch about 11 30 your time 11 11 30 your time --> I like to do it like 45 minutes for lunch, but if you need an hour that we can do that as well --> You know if there's not a lot of questions or a lot of interaction we we usually end a little early --> You know, there's some time built in for those interactions --> You know, so we'll do that so everybody's logged into their desktop you're able to see the files that we're going to work with --> When we get to this part, we're going to actually do an install of NAFA on the Windows operating system --> If you were installing this in Linux, it would actually be a little bit easier, but there is a ton of documentation --> As you can imagine for a government product to the documentation for NAFA is very extensive --> Everything I'm teaching today. Everything I'm going over is in the NAFA docs --> You can actually kind of follow along if you go to --> NAFA.apache.org you will see --> You know tons and tons of documentation, you know, what is NAFA? --> You know, what is the core concepts the architecture some of these things that we're going over --> You know even even the architecture for instance where you have the OS the host which in our case is --> Windows your JVM --> For those that are technical it's a Jetty web server serving this UI on then we have the flow controller --> processors --> You know full file repository that we talked about content repository and provenance repository --> There are local storage now when it says local storage that doesn't necessarily mean --> It's being stored to a local disk --> I've seen local storage be a NAS or some other type of network attached storage system or --> You know those types of things --> So if you want to kind of follow along with some of the documentation, it's all there --> If you go to NAFA.apache.org --> You'll see everything --> Documentation is very well. We're working off of the NAFA version one documentation --> Just because the version two just came out. We'll touch on some of that --> What I like to go off of is the admin guide or the user guide --> Those are the two guides that that I work with when we go into some of the processors we can --> Will actually go in and talk about some of that --> But if you look at the admin guide --> It kind of you know for or an open source product the documentation for NAFA is is amazing --> Usually we don't see this type of documentation --> But as you can imagine being a government product that was released to the open source world --> We had to do a lot of documentation before that was released --> The documentation is also built in to NAFA --> Even down to the processors when you're developing a processor for those that have developed a processor before --> You did have a place where you could include a description as well as other documentation --> So, you know, we'll go through that but I want to make sure that --> That your desktops are running and if you're in your browser, you can pull up the NAFA user guide and admin guide --> And follow along --> as well --> Okay --> You know a flow file in --> abstraction that represents a single piece of information or a data object --> Within a data flow, you know --> so take in this case and I'm using this example is because I just implemented a a --> huge prototype for this --> You know long messages coming in you may have a single message and when it comes into NAFA --> You know treats that as a flow file --> so that flow file --> Is that message it can be in any format it can be in any kind of protocol or those types of things --> You know, so when NAFA receives that --> it generates that as a flow file and then you know --> Within that flow file it consists of two major components the metadata and the data payload --> the metadata is the attributes and we'll go into more of that where we're able to --> Take that data flow and make it into an attribute --> It has a lot of attributes --> So as soon as NAFA touches this data flow it gets assigned an attribute of like, you know --> A date time group of when it was noticed what the source was --> You know those types of things when we're using a processor that goes and grabs data from --> An HTTP website for instance, it will you know what it will record what URL that --> Grab the data from location and those types of things all of that is metadata now --> metadata is separate from the --> actual data file --> But you know we're able to to work with that metadata because you may want to route --> You know your data file based upon source for instance --> And then when the when the data is coming in if you see source X --> You may want to send it one way you see source why you may want to send it another way and all of that --> Would be in the metadata --> You know as it's receiving the data file --> We can also take a look at the data file for instance --> But you know in a lot of cases we like to use those attributes and we'll go into that --> When we're building it, I'm very interactive. I like to do a lot of hands-on work. And so --> We're going to start building some data flows and we go through those will --> Basically repeat what we have here on the slide --> So attributes are key value pairs that store metadata about the data includes basic information file name size --> timestamp --> Any additional metadata added by processor? --> You know again, so it will say you're using the get --> HTTP or get FTP which would get a process from an FTP server --> It will put metadata such as the server name --> IP address, you know some of those types of things that can capture --> Content you know the content of flow files the actual data carried by the file --> So, you know depending on the application it can be text. It can be binary --> Any other formats we have used it before to --> detect --> Heart murmurs and stuff like that and heartbeat data, so we would actually bring in --> audio --> recordings of you know of your heart --> filter and sort those and --> Use, you know --> some additional processors to --> Extrapolate that data look for heart murmurs those types of things --> You know, like I said, I've seen almost every type of data and go through not five --> So yeah, the content is what is processed or transformed by the processors --> There is processors to handle attributes, but most of the processors is to work on the content --> So it actually works on --> that package of data --> Life cycle of a flow file, you know full files are created by source processors that ingest into nine five --> They are processed and potentially split merged or transformed as they move through the flow --> Full files are finally exported out of nine five by a destination processor or store --> You know, so as you know the life of a flow file as you can imagine is being ingested into the system --> it's going through, you know different operations and --> You know at the end it's going to its destination. So the final step is to push that flow file out to its final destination --> Record that in the data governance and then it drops the the flow file --> Water flow files important, you know understanding the structure and lifecycle flow files is crucial because they are the backbone --> of the data flows and not five --> So efficiently, you know --> One of the things I like to do --> Talk about some of the efficiency of a data flow --> so efficient management of flow files ensures that data is processed reliably and --> efficiently --> Maintaining data integrity and traceability --> One of the things at the end of this class I will do is I like to take back any questions that I can't --> Well, I'll be able to answer questions immediately --> Sometimes on those questions --> You know, so if I pause for a second when you're asking a question so I can write it down --> I like the you know at the end of the class --> I like to send out this presentation as well as the q&a portion --> So any questions asked I can write them down. I can get them answered and get them incorporated into this presentation --> You know, so the class is over on Wednesday --> Uh in most likely around Friday or next Monday. I will send this presentation out to the wider audience --> Just so you'll have it for reference and you know, some of that training material that that we can leave behind --> um, I --> Think I've kind of nailed, you know some of the key concepts of not fine depth --> But just in case, you know processors are there. There's a primary component within not five. We'll talk about that a lot --> There's different types of processor, you know, they --> Processors tailored for different tasks --> Just so you know and and you know as a because i'm still part of that community --> I know what's coming up and --> Um, you know some of the nuances there --> when you download nifi now you still get a --> You know, I think it's like 300 processors out of the box --> Um, you know one of the biggest complaints is is you know --> I just don't really need all these processors or I need my own processor and --> The download is one and a half gig of --> Just for nifi and most of that space is actually the processors because of something --> So, you know one thing to keep in mind is as nifi continues to release updates --> um in the updates they are going to --> not put as many processors and you can go to --> Uh some different sources like, you know maven online and others to pull those down --> They will still be built and ready to go --> And there will be some that you know source code that you will need to compile and build and deploy --> But for today, uh, for instance, we have all the processors we will need within --> Our downloaded apache nifi, uh, and we'll go through those --> Custom processors we we've already talked about too. Uh, if there's --> In my --> Uh experience, right? Usually what comes out of the box will work about 95% of the time --> Uh, I do run into cases where we will need a custom processor --> Uh, you know, I can think of a couple for this past implementation. I did --> Where we needed some specialized connectors for some of the --> Tools for instance as or as well as like log systems --> Things like gray log and and other things out there --> So being able to interface with different applications, you know, that's usually when we build a new processor --> Uh, we will build a new processor --> Depending on you know, there's some models that you can run in flight --> Uh, you know, you can do image classification image recognition models, you know things like that as the data is coming through --> Uh, depending on the output of that model you may --> You know filter or or change direction or send it to a different data flow --> You know, so so there's a lot of capabilities, uh --> You know for custom processors, um in those types of things --> Connections are links that route flow files between processors. We'll go into that --> Talk about back pressure --> You know some of those things they are they not only transfer data --> But also control the data flow management such as prioritization back pressure and load balances --> You know, there's there's a few different policies within nine five --> You know, you can do a five-fold method the first in first out --> Uh, you can do, you know some very advanced, um --> Routing with the rules engine for instance, um, you know you can do all kinds of things --> We'll go into some of the back pressure what it does as well as some of the load balancing --> Uh, and then to finish this off, you know enhancing data flow with connections --> Connections can be figured with specific settings to manage how data moves through the system. You may --> You know, you may have a use case where you need data to arrive --> Uh to a processor before another packet of data arrives you can set that up. You know, you may --> Uh, you know, you may have a data flow that you want to take priority on it's data, you know --> You know processing where you know, you've got other data flows that are kind of a lower level priority --> You know, you could set those types of things --> Um, you know, there's a lot of a lot of capabilities here a lot of customization. That is that's part of nine five --> Again, you know, that's the power of it --> But when we get down to some of the design principles and how to do things --> Um, you know --> We'll see this even in this class on some of the tasks that we will have to to build a flow --> And how they will be different --> All right --> We're getting close to to getting done with the the presentation you're going on break --> Uh, and then we get back from break. We'll work on getting NAFA up and running --> uh, but templates and version control so --> Uh, you know templates in NAFA are free to find configurations of a data flow --> They can be saved and reused --> Um, you see this quite a bit, you know --> Most most most organizations have went, you know away from templates and went to the version control as you can imagine --> Just because you know, you can integrate this into your cscd process you can --> um, you know templates --> That you can't work with templates like you can uh, you know a flow file backed up into gith or git lab or github --> You know the NAFA registry which will also go over --> Uh in those types of things, but you know, you can create a template --> I like creating templates sometimes because you know --> I don't have to worry about the the git lab and the github connection those types of things --> I can go to the canvas. I can build my flow --> I can save it as a template and send it to my colleague for instance --> My colleague can quickly import a template --> That flow will be up and running on their canvas and and they can go from there. So --> So templates are are pretty important --> um, but you know here lately --> It's more and more about version control. So --> They encapsulate a set of processors connection and controller services for a specific task or workflow --> uh, you know --> Templates simply simplify the deployment of common patterns and promote best practices by allowing users to deploy tested flows quickly --> um, and and that's the key here is --> You know tested flows quickly. So if you develop a flow you can save it as a template --> export that as a nxml file send that to your your colleague and you should be able to --> Uh and quickly, you know get that flow up and running and and go from there. So, you know --> That's that's template --> NAFI does integrate with NAFI registry, which we will go over --> Which supports versioning of data flows? --> Version control is crucial for managing changes to data flows over time --> Allowing users to track modifications revert to previous versions --> Uh and ensure that the deployments across different environments are consistent --> That's the key ones that we will be working off of. I know we have --> Let's see --> We have a few folks that i've written down that --> Um that would be interested in that --> Some sys admins and those folks --> Um, so the the main thing here is is we're going to look at we're going to touch on templates --> We're going to probably save a template but version control --> Will be our main --> Avenue of saving our flows and those types of things --> um --> And we'll go into using NAFI registry for version control --> You know NAFI registry allows for the storing and retrieval and managing of the version flows --> When we go to the NAFI desktop --> And after we get registry up and running you're going to be able to save your flow check them in check them out --> And those types of things --> And then we will talk about how you can version control those from registry into your own github or git lab environment --> Um, I don't know if someone wants to let me know what what environment you look like I can focus on that --> But you know we can work with a lot of different versions in control system --> Okay, so, uh, let me see this other than chat --> Okay --> So, um, what I like to do is pause here before we go for a quick break --> um, but --> You know what challenges? --> Do you anticipate in implementing or migrating NAFI into your current workflow? --> You know, i'd like to hear from the group on some of the the challenges you may have --> um, and like I said that helps me in tailoring the conversation as well as --> um, you know --> What what we will be trained on so what you know --> What what are some of your challenges in implementing or migrating to NAFI in your in your current process, right? --> And feel free someone just to start talking --> Um considering we're not running it at all --> Fear of the unknown --> We're deploying it with --> The thing we haven't gotten working is multi-tenancy --> So it's just it's still single user mode and it seems like any option we select --> Deploying it from a container. It's --> Single user mode. So i'm wondering if deploying it as a container single user mode is your only option --> um --> It is not but we will we will touch on that. Um, but I can I can understand that pain point as well --> um, okay, um --> One of the biggest challenges for us could be the you know, the cyber security, um aspect that you touched on at the --> beginning --> Okay, you know, I mean even though we know we know it's been a to --> multiple locations and all that but you know, we still have to go through the whole rigging rule for --> our actual demand so --> Um, that's gonna be you know my challenge on --> so, um, no and all those are --> Good things and and I really like the fear of the unknown. Um, you know when we go through this I feel like --> you know, you'll get --> You know less of that fear, uh, just because once you see how easy it is to --> To operate and to start off, you know, I think it's pretty quick to get up and running --> Uh, then it becomes pretty deadly because of all the capabilities and the options you may have and it gets a little overwhelming --> But those are some things I will definitely touch on the multi-tenancy --> Is not necessarily in this class, but what I will do is take that back --> um, and i'm going to work that in for like tomorrow or wednesday --> uh to definitely go over some of that and what that would look like we do have docker desktop on our --> um --> All of our vms and so, you know, we we can we can touch on that and see how that works --> Um, and then definitely we can hit some security aspects all day long. Um, --> Okay, how can the features of non-finance such as data provenance that you know enhance your data governance practices? --> And I ask that because you know, i'm trying to get a better understanding of you know, some of the data governance --> You know requirements you may have uh, you know some of the thoughts, you know, there's --> Um, you know, there's big data governance packages that are out there --> Um, you know, do you have those types of requirements, you know --> Those types of things because it helps me kind of tailor this to what you can expel. Um --> Anybody want to speak on their data governance practices and how this how not five, you know to get --> You know some additional information on enough off of that off the top of my head. No, but when you were talking earlier --> about --> I don't know what some of the telecom and all them were doing --> One of the ideas I thought that I had in my head was like, you know getting event logs or whatever getting --> NYFI and I know from like the central log server srg --> They want to make sure that the data has been modified and all that stuff --> Um, so I think like if we went down a road like that data governance could help in that aspect --> Um, but for like the test community, uh, that would have to be answered by tyler or randy --> Well, that's a good point that's a that's a good point --> So that chain of custody right, you know that --> You can see that data if it was if it was manipulated with their security aspect behind it --> um, and then honestly --> That's why telecoms are using this. Um, you know because of some of those capabilities --> Yes, it was like in the srg they wanted message hashing --> You know, uh, you know, um digest all that all that, you know fun stuff and associating it to that particular message --> um, and it's not --> Easy to do with like our syslog and stuff like that. So --> This could us, you know --> Maybe this is something that we could look into at some point that could help us close those gap --> Oh, definitely --> Okay, anybody else with uh, some of their your data governance --> Perfect. Um, are there any specific processes in your operation that could immediately benefit from the nine five capabilities? --> And I know that that's kind of broad but you know where you know, I like to hear from the audience --> Where do you see nine five fitting in and how it can fit in and how can it just help you? --> You know do that data orchestration? --> I bet I think bam's taking a break --> Uh, we have a compliance database written and access that we use that pulls from --> A bunch of different sources typically using rest api. Um, and all that stuff's done manually --> Oh, wow, I feel like that that's really low-hanging fruit --> Wow --> Yeah --> It's been done manually by somebody on that's team for --> quite a while now --> and and that actually touches on some of the --> Initially kick this off where? --> You know we see scripts just a single python script running just to do something right and and --> You know, it seems kind of small and you're putting this this this big project in front of it --> But you know really understanding, you know those data sources getting those data sources into --> Did you say access like microsoft access? --> Next you're going to tell me you use excel. Um --> So, uh, you know, just yeah being able to read the data from uh access push that data in and keeping those --> you know keeping the --> um --> A record of all of that, you know is definitely needed. I think it will help you with you know --> Some of your compliance issues, uh, it'll help automate that. Um, it'll you know --> There's a lot of rules and a lot of triggers and things like that you can build in --> um, you know and so --> Perfect. Okay --> Uh, you know, what else what other immediate benefits do you do you guys help to get from an alpha? --> For us that data processing branch we're working on more --> Long-term process so it's not really immediate but it's it's our only use case right now, but it's --> Or essentially a real-time data stream and pipeline from one of our test sites to uh do some --> Uh verification on data so running it through some machine learning models to identify like bad sensor data --> uh --> And do some just data verification while it's coming through the pipeline so we can sort of facilitate an automated qa on the data --> Oh very nice --> Okay --> Um --> I've heard potentially you also in this may be related to --> To the the real-time pipeline --> But you know you're you're trying to get data from a talk or get data to a talk smartly filter those out --> Um, you know those types of things, you know, some of the questions --> Previously was like, you know, how do you how do you get minified to to pull that data in? --> Send it to your talk you've talked can filter what that what that kind of architecture looks like so --> Uh, i'm taking note of that previously, but I think that's still valid --> Yes or no --> Yes, so right now our plan for that is to use minify --> For the ingestion on the instrumentation side now to put that into the --> Oh beautiful those, um where you plan to use minify, what is the --> Is it like a an edge device running linux? --> Uh, is it like a windows laptop? --> Uh, you know, can you go if you can go into details? What you know, what does that kind of look like? --> But there could be future use cases on some more restricted instrumentation works, you know --> Possibly a microcontroller --> One of those things, okay --> All right, and then um --> How might you use non-phys scalability and flexibility to improve the data handling and processing in future projects? --> So, you know and I ask this question because i'm i'm trying to I think it was --> Was --> Sean no, shon's a dev, uh amanda --> um --> And erin, you know essays --> Uh looking at you know deploying this in a multi-tenancy --> scalable fashion --> And and so that's why i'm asking is you know, how might you use? --> How do you plan to use nafa for this? --> um --> And and that'll kind of help me tailor the conversation when we go into some of the scalability --> Some of the flexibility and stuff --> Right you want to take that one? --> You know, I don't I don't think we really have a use case for the scalability --> But I can I can say that we're designing it in this way to account for there's a lot of data analysts at ypg --> And we're expecting --> uh --> A lot of people to when they see the platform want to use it --> So we're trying to design it up front to be able to --> Be scalable --> But I would say our immediate use case --> um --> Doesn't really need that. Okay --> Okay, collar, uh, you have that the scalability issue --> Um, we definitely will have it need for scalability in the future. I mean, I don't know --> Exactly what that looks like yet, but there's um --> several different --> locations that will be --> Creating a lot of data throughout the day at least for this --> Initial project is just going to be sort of you know --> One site and then it's going to expand out to multiple sites for for sort of a single mission area --> And then might move out to to more sites. So --> Scalability initially isn't going to be extremely important. But as it goes on, um --> There's probably going to be quite a few workflows --> So I can imagine being significantly more important in the future --> Okay --> so one of the one of the use cases that I think we might have in the future is um --> We have a data lake that's going to be in the cloud --> But there's a lot of talks from our chief data officer about having an on-prem data lake --> Oh very nice --> Um get that data both places --> Um the test data and if that was the case we produce a lot of data. So I think we would definitely need to --> That and what what um what storage like what database and storage of solution are you looking at for your --> You know your own frame --> You know --> Yep --> Okay --> Um, you know beauty of non-fi is it does have a an s3 processor --> You know, it has a azure blob storage processor, you know and those types of things --> um --> Since you're using minio --> Correct --> Oh we so there's processors for both of them so perfect the the minio doesn't --> Uh, I don't think it comes out of the box yet, but it is available as a processor, uh on github --> Uh, you know, so so perfect. No, I like that. I've actually seen that quite a bit --> Uh lately with minio. It's you know folks --> Coming out of the cloud and still you know, kind of keeping it local for --> You know security reasons compliance reasons and and just uh, you know overall process network activity. Yep --> Exactly --> exactly --> Uh, no, and we will uh, i'll make sure to touch on some of those things --> um, you know as we go through and start building full files and and --> You know those types of stuff. Um, we could potentially even --> On the third day, uh do a a flow where we pick data up and put it to mino --> Okay --> All right. Well that being said, um --> Let's take our our first break. I need to get water since i'm talking a lot --> Um, I want to make sure I keep my voice throughout the day --> Uh, let's take a 15 minute, uh, you know, rest bio break rest and break --> Uh, get some water. We'll meet back here at 11 50 --> Uh, and then 11 50 my time. I think it's 9 50 your time --> And --> Um, then we'll go through installing nafi and windows and start working on building our first flow --> So, uh, we'll see everybody back here in about 15 minutes --> And if you need anything, just put it in the chat. I'll be running back and forth with getting water --> and restroom --> All right, see you continue --> My best is --> And then we are going to get started on installing naf --> I don't know if you're back or not, but I really like to --> Wait till you hear about our processes --> So being I don't know if anybody's here but being uh a former soldier being within the army itself so many years now --> I completely understand the nuances --> While we wait for give a couple more minutes, um usually, you know during software training --> You just kind of run through the software --> Uh, but I felt it was pretty critical for us to actually do an install within windows --> Just so everyone has that experience --> If you're going to be working within nafi even the local environment --> Who knows you may want to spin up your own instance on your your own laptop --> Get it working get your flow built, you know test some things out save it as a template --> Uh, and then you know export that to your dev environment your test environment --> You know those types of things --> When we are and i'll go over this, you know in detail --> but when we're installing nafi there's some key things to to take a look at because --> There is some specific directories being created --> Um, and and there's a reasoning behind that. There are some specific directories that you will --> Need to understand and learn about as well. So that's one of the reasons I like to --> to to really go in depth and --> Um, i'm taking a risk here because I don't have it installed because i'm gonna walk, you know --> We're going to all do it together --> Uh, I do have java on everyone's machine --> Um, so we'll go through some of the the basics so but we'll give it just another minute --> And then we'll get started then if you're back, um --> Can you just let me know like like --> Um, how long you you all think you need for lunch? Like I said around 45 minutes is is --> Kind of what what I like to go off of but I can do an hour as well. No problem --> Right. Yeah. Um, so --> Actually, we usually just do 30 minutes 45 minutes is fine. Um --> Whatever people need okay. Okay, we'll do 45 and um --> That will give you the capability to eat and then also --> You know play around with whatever we've already built and done because you're gonna have --> You're gonna have this desktop environment throughout the training --> Um, and you you do have the capability to download any information that you have there. Uh, you can --> Um, you know i'll upload the presentation as well so you can have it, you know on the desktop environment --> So there's a lot of capabilities, but we will go ahead and get started. Let me exit all this --> Okay, so if everyone can --> Go ahead and and start working off your desktop. I'm sharing my screen --> Um, but you know, uh, if you can let's go ahead and get logged into the desktop environment --> Um, let me see I can pull everyone up --> Looks like everyone is good to go