
news/2024/7/24 12:10:46 标签: python, 大数据, 人工智能, 机器学习, 数据挖掘


Deploy Operators and DAGs to a AWS hosted Apache Airflow and execute your Data Pipelines with DAG and Data Lineage Visualisation.

将操作员和DAG部署到AWS托管的Apache Airflow,并通过DAG和数据沿袭可视化执行数据管道。

是否想偶尔听到有关Tensorflow,Keras,DeepLearning4J,Python和Java的抱怨? (Wanna hear occasional rants about Tensorflow, Keras, DeepLearning4J, Python and Java?)

Image for post

Join me on twitter @ twitter.com/hudsonmendes!

和我一起在twitter @ twitter.com/hudsonmendes上!

Taking Machine Learning models to production is a battle. And there I share my learnings (and my sorrows) there, so we can learn together!

机器学习模型投入生产是一场战斗。 我在那里分享我的学习(和悲伤),所以我们可以一起学习!

数据科学系列的数据管道 (Data Pipeline for Data Science Series)

This is a large tutorial that we tried to keep conveniently small for the occasional reader, and is divided into the following parts:


Part 1: Problem/Solution FitPart 2: TMDb Data “Crawler”Part 3: Infrastructure As CodePart 4: Airflow & Data Pipelines
(soon available) Part 5: DAG, Film Review Sentiment Classifier Model
(soon available) Part 6: DAG, Data Warehouse Building
(soon available) Part 7: Scheduling and Landings

第1部分:问题/解决方案拟合第2部分:TMDb数据“抓取工具” 第三部分:基础架构即代码第4部分:气流和数据管道(即将推出)第5部分:DAG,电影评论情感分类器模型(即将推出)第6部分:DAG,数据仓库构建(即将推出)第7部分:计划和着陆

问题:通过气流部署DAG(The Problem: Deploying DAGs with Airflow)

This project has the following problem statement:


Data Analysts must be able to produce reports on-demand, as well as run several roll-ups and drill-down queries into what the Review Sentiment is for both IMDb films and IMDb actors/actresses, based on their TMDb Film Reviews; And the Sentiment Classifier must be our own.

数据分析师必须能够按需生成报告,并能够基于他们的TMDb电影评论,对IMDb电影IMDb演员/女演员评论情绪进行多次汇总和深入查询 情感分类器必须是我们自己的

In our Part 3 we have setup the entire Airflow & Redshift infrastructure.


Now all we have to do is use it to run our DAGs using Airflow and deploying our Data Warehouse into AWS Redshift, so that we can run our Analytics.

现在,我们要做的就是使用它来通过Airflow运行DAG,并将数据仓库部署到AWS Redshift中,以便我们可以运行Analytics。

The full source code for this solution can be found at:


使用Jupyter部署DAG? 没有! (Deploying DAGs with Jupyter? No!)

Image for post

Our code to deploy DAGs lives in a Jupyter Notebook, but this choice was only done due to the explanatory nature of this project.

我们用于部署DAG的代码存在于Jupyter Notebook中,但之所以选择此选项,仅是因为该项目具有解释性。

Very Important: your code to deploy DAGs must NOT be in a Jupyter Notebook. You should instead use a Continuous Integration system such as Jenkins.

非常重要:部署DAG的代码不得位于Jupyter Notebook中。 您应该改用诸如Jenkins之类的持续集成系统。

Your Continuous Integration pipeline must cover the following:


  • Build code

  • Runs Lint + Static Code Analysis

    运行Lint +静态代码分析
  • Runs Unit & Integration Tests

  • Deploys DAGs into the server


共享组件(Shared Components)

The following components are used through out the DAG setup process.


AWS EC2密钥对 (AWS EC2 KeyPair)

In the previous article (Part 3) we have created a very important file: the "udacity-data-eng-nano'

在上一篇文章(第3部分)中,我们创建了一个非常重要的文件:“ udacity-data-eng-nano”

Image for post

This file will now be used to SCP (copy) the DAG files into our Airflow Server.

现在,该文件将用于将DAG文件SCP(复制)到我们的Airflow Server中。

SCP上传程序 (SCP Upload Procedure)

As the next step, we have a routine that copies files from the local machine into our Airflow DAGs folder.

下一步,我们有一个例程,可将文件从本地计算机复制到Airflow DAGs文件夹中。

Image for post

Once files are dropped into that folder, they are automatically picked up by Airflow, as long as there is no compilation errors!


气流夹结构 (Airflow Folder Structure)

In order to run our Data Pipeline, we need a few folders created:


  • Data: where the raw data copied from HTTP sources will be dropped


  • Model: where we placed the Tensorflow Saved Model


  • Working: Temporary folder for files being generated


  • Images: Where we place assets used in the file generated PDF report


Image for post

气流配置(Airflow Configuration)

Our DAGs use a number of Variables and Connections (check the Airflow documentation to understand better these definitions).


Here we setup both, so that our DAGs can run:


Image for post
Image for post

上载运算子 (Uploading Operators)

DAGs are graphs formed by operators (vertices) edged to one another. In this step we shall upload our custom operators to the DAGs folder.

DAG是由彼此相邻的算子(顶点)形成的图。 在这一步中,我们将自定义运算符上载到DAGs文件夹。

Image for post

最后,上传DAG (Finally, Uploading DAGs)

The DAGS (or Directed Acyclical Graphs) are the main pieces of program in our Data Pipeline.


By uploading these, should they have no compilation errors, Airflow will pick them up automatically and present them in the Airflow web interface.

通过上传这些文件(如果它们没有编译错误),Airflow会自动选择它们并将它们显示在Airflow Web界面中。

Image for post

在Airflow Web界面中启动DAG (Launching DAGs in the Airflow Web Interface)

Once Airflow has recognised the DAGs, it will display them in their web interface, such as in the image below:


Image for post

In order to launch them, you must first enable them and then click the Launch button.


After doing that, you can check on the DAG execution illustration and the Lineage Illustration, such as you see below:


Image for post
Image for post

综上所述 (In Summary)

Using python, we have:


  1. Used scp to copy the DAG Operators and the DAGs

  2. Used the Web Interface to Launch the Data Pipelines

  3. Checked the DAG Run


Now our Data Pipeline run, we have our Data Warehouse waiting for us to run queries and achieve our objective of running drill downs and roll ups in our facts table.


下一步 (Next Steps)

In the next article Part 5: DAG, Film Review Sentiment Classifier Model we will understand what the Classification Model Trainer DAG does, and how we leveraged Data Science to produce our Data Warehouse.


Find the end-to-end solution source code at https://github.com/hudsonmendes/nanodataeng-capstone.


想保持联系吗? 推特! (Wanna keep in Touch? Twitter!)

I’m Hudson Mendes (@hudsonmendes), coder, 36, husband, father, Principal Research Engineer, Data Science @ AIQUDO, Voice To Action.

我是Hudson Mendes ( @hudsonmendes ),编码员,36岁,丈夫,父亲,数据科学@ AIQUDO的首席研究工程师,语音行动。

Image for post

I’ve been on the Software Engineering road for 19+ years, and occasionally publish rants about Tensorflow, Keras, DeepLearning4J, Python & Java.


Join me there, and I will keep you in the loop with my daily struggle to get ML Models to Production!


翻译自: https://medium.com/@hudsonmendes/data-pipeline-for-data-science-part-4-airflow-data-pipelines-e9a2a1db4fea




java 通话和重地 哈希值一样_前阿里P9的Java面试重点1:Java 语言基础



heroku[In this multi-part series, I’ll transform a new application into a multi-tenant experience running in the Heroku ecosystem. This article focuses on the object model, design, architecture, and security.][在这个由多个部分组成的系列中,我将把…

python中dtype什么意思_NumPy Python中的数据类型对象(dtype)

每个ndarray都有一个关联的数据类型(dtype)对象。此数据类型对象(dtype)告知我们有关数组布局的信息。这意味着它为我们提供了有关以下信息: 数据类型(整数,浮点数,Python对象等) 数据大小(字节数) 数据的字节顺序(小端或大端) ndarray的值存…

华为 编程语言实验室,薪水_作为实验室科学家学习编程

华为 编程语言实验室,薪水Stop banging your head against a textbook.别再撞到教科书上了。 Four chapters into “MATLAB for Neuroscientists”, my friend asked me for advice. Her knowledge of psychology and neuroscience helped her a lot as an experimentalist, bu…

helm安装_Helm 带你飞

文章目录 [toc] 在没使用 Helm之前,向 K8S部署应用,我们要依次部署 deployment、 svc 等,步骤较繁琐。况且随着很多项目微服务化,复杂的应用在容器中部署以及管理显得较为复杂, Helm通过打包的方式,支持发布…

aws waf sql注入_您可以使用aws在纸上运行sql查询吗

aws waf sql注入Have you ever drawn some wonderful tabular data on a piece of paper and thought, “Wouldn’t it be nice if I could run an SQL query on this”?您是否曾经在纸上绘制过一些很棒的表格数据,并想过:“如果我可以对此进行SQL查询&…

枚举变量有什么用_C++自定义类型-包含不限作用域的枚举类型(学习笔记:第2章 10)...

自定义类型[1]类型别名:为已有类型另外命名用typedef起别名的格式:typedef 已有类型名 新类型名表例:typedef double Area, Volume; typedef int Natural; Natural i1,i2; Area a; Volume v;用using起别名的格式:using 新类型名 …

python安装pygal_如何在pygal python中创建直方图

python安装pygalScalable Vector Graphics (SVG) is an image format that defines vector-based graphics in XML format. In this tutorial, you’ll look at how to get started with Pygal, a Python SVG graph-plotting library. You’ll also learn how to draw histogra…