Manifold Blog

Manifold Blog

Using Dask in Machine Learning: Preprocessing

Posted by Jason Carpenter on Apr 25, 2019 6:00:00 AM


This is the second post in a five part series about using Dask in machine learning workflows:

  • Using Dask in Machine Learning: Best Practices
  • Using Dask in Machine Learning: Preprocessing
  • Using Dask in Machine Learning: Feature Engineering
  • Using Dask in Machine Learning: Model Training
  • Using Dask in Machine Learning: Model Evaluation

Starting with this post, each installment will have data snapshots and code snippets to give you an example of the problem we are working on. We have this public self-contained GitHub repo. You can pull that repo and run the code yourself and follow along more closely.

Read More

Topics: Data engineering, Machine learning

Your Project Needs a Data Readiness Audit

Posted by Vinay Seth Mohta on Mar 21, 2019 6:00:00 AM

In the early phase of a new project, we dive into the “Understand” step of our Lean AI framework. There are two main forms of understanding we aim for — business understanding and data understanding.

Read More

Topics: Data engineering

Efficient Data Engineering

Posted by Jakov Kucan on Feb 7, 2019 7:32:34 AM

A typical data engineering problem, often referred to as extract, transform and load (ETL), consists of the following:

  1. take data in one place (extract)
  2. change its form (transform)
  3. move it to a new place, in this new form (load)

This process gets interesting when data volumes are large, and you have to consider performance. Long turnaround time (e.g., a run taking several hours or days) makes the typical serially iterative software engineering approach inefficient. In this article, we offer some tips on re-structuring the software engineering process and leveraging the cloud to make iteration more efficient.

Read More

Topics: Data engineering

Using Dask in Machine Learning: Best Practices

Posted by Jason Carpenter on Jan 31, 2019 6:00:00 AM


The Python ecosystem offers a number of incredibly useful open source tools for data scientists and machine learning (ML) practitioners. One such tool is Dask, available from Anaconda. At Manifold, we have used Dask extensively to build scalable ML pipelines.

Read More

Topics: Data science, Data engineering, Machine learning

Incremental Synchronization: Replicating Actions vs. State

Posted by Jakov Kucan on Jan 15, 2019 7:00:00 AM
Whenever searching for an optimal solution to a problem, one is faced with design decisions on the appropriate architecture and approach. This post discusses one such problem, in order to highlight key decision points: data synchronization from one store to the other. We contrast two approaches and pose questions that can help inform the design decisions. The approaches we look at are: replicating source actions (insert, update, delete) at the destination data store, and replicating the state of the source store in the destination store.
Read More

Topics: Data engineering

3 Ways Artificial Intelligence Could Boost the Success of Your Business

Posted by Vivek Mohta on Jan 4, 2019 7:00:00 AM

As the artificial intelligence field continues to grow, businesses across the country have found that techniques are coming out of the research lab and into the applied realm to benefit their operations.

Read More

Topics: AI at the edge, Computer vision, Data engineering

Custom Loss Functions for Gradient Boosting

Posted by Prince Grover on Sep 28, 2018 3:27:51 PM

By Prince Grover and Sourav Dey


Gradient boosting is widely used in industry and has won many Kaggle competitions. The internet already has many good explanations of gradient boosting (we've even shared some selected links in the references), but we've noticed a lack of information about custom loss functions: the why, when, and how. This post is our attempt to summarize the importance of custom loss functions in many real-world problems — and how to implement them with the LightGBM gradient boosting package.

Read More

Topics: Data science, Data engineering

A Python Toolkit for Docker-First Data Science

Posted by Alexander Ng on Apr 19, 2018 7:00:00 AM

As interest in Artificial Intelligence (AI), and specifically Machine Learning (ML), grows and more engineers enter this popular field, the lack of de facto standards and frameworks for how work should be done is becoming more apparent. A new focus on optimizing the ML delivery pipeline is starting to gain momentum.

Read More

Topics: MachOps, Data engineering, Orbyter

Never Miss a Post

Get the Manifold Blog in Your Inbox

We publish occasional blog posts about our client work, open source projects, and conference experiences. We focus on industry insights and practical takeaways to help you accelerate your data roadmap and create business value.

Subscribe Here

Popular Posts