SPARK + AI SUMMIT 2019: Key Annoucements

7 May 2019 By Lucrezia Noli Comments are Off

Spark AI Summit 2019 hosted a vibrant community of users and developers of Machine Learning solutions. Talks were well-differentiated for level of technicality so that every participant could find what he/she was looking for, while during keynotes some of the principal actors within Databricks introduced many exciting news, which are the content of this blog post.

Specifically, I will be concentrating on three main announcements:

Koalas
Delta Lake
Managed MLFlow

while there have been a number of other announcements, such as the introduction of generalized UDFs, I will be covering those in a separate blog post, whenever more information will be available.

Sommario

Koalas

This is my personal favorite. Part of Apache Spark 3.0, that will be released later this year, Koalas allows to make use of Pandas without having to either downsample the data, or migrate to PySpark. Indeed, being single-node, single threaded and in-memory, Pandas works well with small and medium size data. If you have big amounts of data and want to use Pandas, you’ll probably have to downsample the dataset. At the same time, you might not want to migrate your already-written code into PySpark, which though similar is not identical to Python.

By simply using Koalas package, data scientists can now use their Pandas code as it is, while at the same time fully exploiting Apache Spark’s distributed environment. This means that with the exact same code used in Pandas, a job can be executed much faster just by running it in the Spark environment. Moreover, because Koalas runs with Apache Spark execution engine, we can still switch between all the languages supported by the engine itself within a single workbook.

Another great news besides for its capabilities is that the project is open source.

Delta Lake

While the Delta Lake project was already announced by the company months ago, the news here relates to Databricks making it open source.

Delta Lake is a storage layer to be inserted between a company’s already existing Data Lake and the tools to query or analyze those data. The number one aim of Delta Lake is to make the data reliable, which it does in a number of ways:

provides ACID transactions between multiple writes
automatically checks that what is being written to the data lake is compatible with a set schema
stores metadata information in the transaction log instead of the metastore
allows to read and use previous versions of a table or directory
can be used for both batch writes and streaming sink
will feature merge, update, delete DML commands
will support data expectations on tables and directories

Managed ML Flow

MLflow has been around for almost a year now, helping data scientist manage the overall Machine Learning lifecycle and tracking different experiments. One of the differentiating features of this open source Machine Learning platform is that it’s designed to work with any Machine Learning libraries, programming languages, deploy tools, etc. The contributors already add up to more than 80 people from 40 companies in just 10 months.

The three main components of MLflow user interface are:

tracking, that allows to inspect different runs of ML models and is used in various ways to assess which parameters work best, and compare different models
projects, which makes it easy to specify for a code which dependencies are needed by adding this information to the repo file so that everyone working with it has everything needed to run the project
models, a way to package the created models and deploy them using built-in tools, irrespectively of which packages were used to create such models

There have been a few different announcements about the MLflow project:

Microsoft joining the community by supporting MLflow tracking API in Azure ML
MLflow 1.0, a more stable API for long-term use
Two new components:

- workflows, an easy way to design & share multistep pipelines
- model registry, a feature that allows to manage, tag and version models in the server

Speeches of the three keynotes announcing these news can be found online:

Delta Lake by Ali Ghodsi & Michael Armbrust
MLflow by Matei Zaharia
Koalas by Reynold Xin & Brooke Wenig

Comments are closed.

SPARK + AI SUMMIT 2019: Key Annoucements

Koalas

Delta Lake

Managed ML Flow

Latest Posts

Categories

Contact us now

Newsletter

Social

Company

Services

Contacts

Newsletter

Sign up to our newsletter