人工智能现在可以驱动所谓的要素工程(Feature Engineering),允许用户自动发现和创建数据科学处理功能。这种做法开启了一种全新的数据科学方法,似乎会威胁到数据科学家的作用。
在过去几年里,AutoML快速增长。而且目前看来,经济衰退无可避免,人工智能(AI)和机器学习自动化开发的观念也必将越来越有吸引力。业界现在推出的各种新平台(https://dotdata.com)都具有更多的自动化功能。人工智能现在可以驱动所谓的要素工程(Feature Engineering),允许用户自动发现和创建数据科学处理功能。这种做法开启了一种全新的数据科学方法,似乎会威胁到数据科学家的作用。
那么,数据科学家需要关注这些发展吗?数据科学家在自动化过程中的作用是什么呢?鉴于这种新发现的自动化技术,企业又将如何发展呢?
![The traditional data science process](https://specials- images.forbesimg.com/imageserve/5e7e6c76f40ef500079f8933/960x0.jpg?fit=scale)
The traditional data science process dotData, Inc
传统的数据科学过程(图;dotData公司)
**AutoML 2.0, More Automation for Data Science**
AutoML 2.0必将令数据科学更加自动化
First-generation AutoML platforms have focused on automating the machine learning part of the data science process. In a traditional data science workflow, however, the longest and most challenging part is the highly manual step known as feature engineering. Feature engineering involves connecting data sources and building a flat "feature table" with a rich, diverse set of "features" that is evaluated against multiple Machine Learning algorithms. The challenge of feature engineering is that it requires an elevated level of domain expertise to “ideate” new features and is very iterative as features are evaluated and rejected or chosen. New platforms, however, have recently emerged that provide additional capabilities and automation aimed at solving this challenge. Platforms with "Automated Feature Engineering" capabilities now allow for the automated creation of feature-tables from relational data sources as well as flat files. This ability to "auto-generate" features in the data science process is a game-changing capability. Suddenly, the "citizen" data scientists - Business Intelligence (BI) analysts, data engineers, and other technically savvy members of the organization with deep domain knowledge - can become valuable contributors to an organization's development of ML and AI models. Through Automated Feature Engineering, BI teams can suddenly develop sophisticated predictive analytics algorithms in days, significantly accelerating their productivity with minimal help from data scientists.
第一代AutoML平台的重点主要放在自动化数据科学过程中的机器学习部分。但在传统的数据科学工作流程里,最冗长和最具挑战性的部分则是被称之为是要素工程的部分,要素工程是高度手动的一步,主要涉及到连接数据源及构建宽大的“要素表”,需包含丰富多样的“要素”。与此同时,这些要素还需要针对多种机器学习算法进行评估。
目前,要素工程面临的挑战是,只有用更高水平领域的专业知识才能“酝酿”新的要素,而且这一过程需要在评估、拒绝或选择要素时反复地做。但最近业界出现了新平台,这些新平台可以提供旨在解决这一挑战的附加功能及自动化功能。现在一些具有“自动要素工程”功能的平台可以从关系数据源以及无结构文件里自动创建要素表。这种能够在数据科学过程中“自动生成”要素的方法,可以说是个改变游戏规则的功能。
于是,突然之间,“公民”数据科学家开始成为组织开发ML和AI模型的有价值贡献者。一般来说,「公民数据科学家」指的是商业智能(BI)分析师、数据工程师和组织中其他具有深厚领域知识的、精通技术的成员。借助于机器学习,BI团队利用自动化要素工程可以在几天之内开发出复杂的预测分析算法,无需数据科学家帮忙就可以极大地提高生产力。
**Automating Data Science: Democratization**
自动化数据科学:平民化
One of the chief benefits of AutoML 2.0 platforms is true data science democratization. When data science automation can accelerate and automate the process of discovering and creating features, it allows for a more diverse and abundant group of users to contribute to the data science process. Automation of feature creation allows the "citizen" data scientist to create incredibly useful, highly optimized use-cases. Because citizen data scientists typically have a high degree of "domain expertise," they can focus on use cases that are of high value to the organization with minimal if any assistance from the data science team. The added benefit of enabling citizen data scientists is that it allows the business to expand their use of data science without having to worry about hiring armies of data scientists. The ability to empower new data science contributors is especially significant given the difficulty organizations in the US have had in hiring data scientists, as examined in[ a 2018 LinkedIn study](https://news.linkedin.com/2018/8/linkedin-workforce- report-august-2018 ). With economic uncertainty facing the global community, enabling a new class of AI/ML developers with minimal investments becomes a game-changing value proposition to maintain or increase competitive advantages.
AutoML 2.0平台的主要好处之一是可以用于真正的数据科学平民化。
数据科学自动化可以加速发现要素和创建功能的过程,而且是自动的,如此一来,更多的用户群体就可以为数据科学过程做贡献。要素创建的自动化使得“公民”数据科学家能够创建极有用的、高度优化的用例。而且公民数据科学家通常具有高度的“专业领域知识”,因此他们基本无需数据科学团队的帮助就可以将重点放在对组织具有高价值的用例上。
开启公民数据科学家的另一个好处在于,企业无需担心招不到数据科学家而一样可以开拓数据科学的使用。2018年 LinkedIn的一项研究表明,美国的组织在雇用数据科学家方面遇到困难。鉴于此,能够发掘新的数据科学贡献者就显得尤为重要。
眼下,全球经济面临着诸多不确定性,在这种情况下能以最少的投资发掘出几类新的AI/ML开发人员,必将成为改变游戏规则的价值主张,在维持或增加竞争优势上意义重大。
**Automating Data Science: Productivity, Not Replacement**
自动化数据科学:生产力而非替代
Any conversation of AutoML 2.0 platforms, however, is misplaced if the focus is on replacing or displacing the data scientist. Most data scientists see feature-engineering as one of the most significant obstacles to their work. Automation can only help to accelerate the process by providing incredible productivity boosts that would not be otherwise possible without automation. By leveraging AutoML 2.0, data scientists can often accelerate their work dramatically - from months to days. Besides, the use of AI-based feature engineering in AutoML 2.0 platforms, allows data scientists to discover features that they would have never considered. AI-based feature engineering automatically builds, evaluates, and exposes features by combining data from multiple columns, often across different tables and sources. The ability of AutoML 2.0 to self-discover features allows data scientists to explore the so- called "unknown unknowns," the features the data scientists would have never even considered because of either lack of time or lack of domain expertise.
但任何AutoML 2.0平台如果将定位的重点放在替换或更替数据科学家上就大错特错了。大多数数据科学家都将要素工程视为工作中的最大障碍之一。自动化可以帮助加快要素工程的流程,靠的就是自动化可以提供令人难以置信的生产率提升,这种提升若无自动化是不可能实现的。
对于数据科学家来说,利用AutoML 2.0通常可以极大地加快自己的工作,缩短的工作时间从几天到几个月不等。而且,数据科学家在AutoML 2.0平台上使用基于AI的要素工程还可以发现他们从未考虑过的要素。基于AI的要素工程可以自动构建、评估和开通要素,而且可以结合来自基于多列的数据(通常是跨越不同的表和源)。
此外,AutoML 2.0还具有自我发现要素的功能,数据科学家借此功能可以探索所谓的“未知的未知数”,这种“未知的未知数”属于那些数据科学家由于缺乏时间或缺乏领域专业知识而从未考虑过的要素。
**AutoML 2.0: Creating A More Productive, More Inclusive AI/ML Program**
AutoML 2.0:创建更高效、更具包容性的AI / ML程序
Rather than being a threat to the livelihood of data scientists, AutoML 2.0 platforms are, in fact, an enabling technology that helps accelerate and democratize the data science process. AutoML 2.0 provides the acceleration and automation necessary for data scientists to be more productive, giving them the ability to scale their work and providing an even more significant benefit to the business. This two-fold advantage of democratization and acceleration of the data science process are the most significant selling points of AutoML 2.0 platforms and the key to scaling the data science process in the modern organization.
所以,AutoML 2.0平台并没有威胁到数据科学家的生计,反而有助于加速数据科学过程及令数据科学平民化。与此同时, AutoML 2.0也为数据科学家提高生产力提供了必要的加速和自动化手段,令数据科学家能够扩展工作规模并为业务带来更大的效益。AutoML 2.0平台具有推动数据科学平民化和加速数据科学流程的双重优势,也是其最重要的卖点,这种双重优势是现代组织扩展数据科学流程规模的关键。