SPARK + AI SUMMIT 2019: Key Annoucements

spark ai san francisco data skills

Spark AI Summit 2019 hosted a vibrant community of users and developers of Machine Learning solutions. Talks were well-differentiated for level of technicality so that every participant could find what he/she was looking for, while during keynotes some of the principal actors within Databricks introduced many exciting news, which are the content of this blog post.

 

Specifically, I will be concentrating on three main announcements:

  • Koalas
  • Delta Lake
  • Managed MLFlow

 

while there have been a number of other announcements, such as the introduction of generalized UDFs, I will be covering those in a separate blog post, whenever more information will be available.

 

Koalas

This is my personal favorite. Part of Apache Spark 3.0, that will be released later this year, Koalas allows to make use of Pandas without having to either downsample the data, or migrate to PySpark. Indeed, being single-node, single threaded and in-memory, Pandas works well with small and medium size data. If you have big amounts of data and want to use Pandas, you’ll probably have to downsample the dataset. At the same time, you might not want to migrate your already-written code into PySpark, which though similar is not identical to Python.

 

By simply using Koalas package, data scientists can now use their Pandas code as it is, while at the same time fully exploiting Apache Spark’s distributed environment. This means that with the exact same code used in Pandas, a job can be executed much faster just by running it in the Spark environment. Moreover, because Koalas runs with Apache Spark execution engine, we can still switch between all the languages supported by the engine itself within a single workbook.

 

Another great news besides for its capabilities is that the project is open source.

 

Delta Lake

While the Delta Lake project was already announced by the company months ago, the news here relates to Databricks making it open source.

Delta Lake is a storage layer to be inserted between a company’s already existing Data Lake and the tools to query or analyze those data. The number one aim of Delta Lake is to make the data reliable, which it does in a number of ways:

  • provides ACID transactions between multiple writes
  • automatically checks that what is being written to the data lake is compatible with a set schema
  • stores metadata information in the transaction log instead of the metastore
  • allows to read and use previous versions of a table or directory
  • can be used for both batch writes and streaming sink
  • will feature merge, update, delete DML commands
  • will support data expectations on tables and directories

 

Managed ML Flow

MLflow has been around for almost a year now, helping data scientist manage the overall Machine Learning lifecycle and tracking different experiments. One of the differentiating features of this open source Machine Learning platform is that it’s designed to work with any Machine Learning libraries, programming languages, deploy tools, etc. The contributors already add up to more than 80 people from 40 companies in just 10 months.

 

The three main components of MLflow user interface are:

  • tracking, that allows to inspect different runs of ML models and is used in various ways to assess which parameters work best, and compare different models
  • projects, which makes it easy to specify for a code which dependencies are needed by adding this information to the repo file so that everyone working with it has everything needed to run the project
  • models, a way to package the created models and deploy them using built-in tools, irrespectively of which packages were used to create such models

 

There have been a few different announcements about the MLflow project:

  • Microsoft joining the community by supporting MLflow tracking API in Azure ML
  • MLflow 1.0, a more stable API for long-term use
  • Two new components:
    • workflows, an easy way to design & share multistep pipelines
    • model registry, a feature that allows to manage, tag and version models in the server

Speeches of the three keynotes announcing these news can be found online:

  • Delta Lake by Ali Ghodsi & Michael Armbrust
  • MLflow by Matei Zaharia
  • Koalas by Reynold Xin & Brooke Wenig

 

Comments are closed.

Sign up to our newsletter













    I declare that I have read the privacy policy