# Manifold Blog

Whenever searching for an optimal solution to a problem, one is faced with design decisions on the appropriate architecture and approach. This post discusses one such problem, in order to highlight key decision points: data synchronization from one store to the other. We contrast two approaches and pose questions that can help inform the design decisions. The approaches we look at are: replicating source actions (insert, update, delete) at the destination data store, and replicating the state of the source store in the destination store.

## The Problem

Given a source-of-truth data store (the source) and a read-only copy (destination), how does one keep the destination consistent with the source? Examples of this type of problem include database read replicas, updating search index, etc.

To define the problem more precisely, we make the following assumptions:
• Source and destination data stores are not necessarily the same (e.g., source can be a MySQL database and destination may be an Elasticsearch index).
• Data will be transformed in some fashion when copied from source to the destination (e.g., we may denormalize source data by joining two tables into a single table in the destination store).
• The synchronization must be done in near real-time. As a practical matter, this means that we cannot copy the entire source data store on periodic basis, but need to keep the destination store up to date incrementally.
• To make the process possible, we assume that the source data store will send an event whenever a piece of data is modified.

### Example

Consider the source data store to be a MySQL database with one table (for simplicity), and we want to keep a Elasticsearch index up-to-date with the source DB. In the source we have a pets table:
`id   | name | kind  | dob--------------------------------1    | Rex  | dog   | 2018-01-012    | Joe  | snake | 2017-03-02`
...which will be replicated in the Elasticsearch pets index as:
`{  "id": 1,  "name": "Rex",  "kind": "dog",  "dob": "2018-01-01"},{  "id": 2,  "name": "Joe",  "kind": "snake",  "dob": "2017-03-02"}`
If a change (INSERT, UPDATE, DELETE) statement is executed in the source DB, we will receive an event with old and new record values. For example, if we update dob in record 1, we should get an event:
`{  event_type: UPDATE,  before_values: { id: 1, name: Rex, kind: dog, dob: 2018-01-01 },  after_values: { id: 1, name: Rex, kind: dog, dob: 2018-03-17 },}`
With this information, we can update the record in the Elasticsearch index:
`PUT pets/_doc/1{ "name": "Rex", "kind": "dog", "dob": "2018-03-17" }`

### Formal Problem Definition

A source system S is defined as a set of records s , and a destination system D as a set of records d. A state of the source (destination) is the set of records s (d) respectively.

There is transformation T: S→ D, between the source and the destination. A destination state d is said to be consistent with the source state s if d = T(s).

For a source system, there is a set of actions A. The effect of an action a in A on the state s is denoted as a new state a(s).

Suppose we have a sequence of actions a1, a2, ... an transforming the state s to state s'=an(...(a2(a1(s))...). This transition, call it X(s, s'), can be represented as the sequence of actions (a1, ..., an), or as a difference between the states (s' - s), or a combination of both.

The synchronization problem can be defined as follows. For a transition X(s, s') of the source, find an action b (or sequence of actions) that will map state d = T(s) of the destination such that b(d) = T(s').

## Replicating Actions

In this approach, for every source change event, we execute a corresponding action updating the destination database. In the example above, if we receive a insert event:
`{  event_type: INSERT,  values: { id: 3, name: Primrose, kind: cat, dob: 2015-03-01 }}`
...we can update the index by executing:
`PUT pets/_doc/3{ "name": "Primrose", "kind": "cat", "dob": "2015-03-01" }`

Formally speaking, given X(s, s') = a (or equivalently s' = a(s)), we need to find a' such that T(s') = a'(T(s)) or equivalently T(a(s)) = a'(T(s)).

## Replicating State

In this approach, we use the source change event merely as a trigger to query the source data and update the destination store based on the result of that query. In our example, we would get the event:
`{  event_type: INSERT,  values: { id: 3, name: Primrose, kind: cat, dob: 2015-03-01 }}`
....query the source using:
`SELECT * FROM pets WHERE id = 3`
...then using the response of the query update the destination store by executing:
`PUT pets/_doc/3{ "name": "Primrose", "kind": "cat", "dob": "2015-03-01" }`
More formally, given X(s, s') = s' - s, we need to find action a' such that a'(T(s)) = T(s').

### Contrasting the Two Approaches

The two approaches look similar, and the differences are only apparent when we look at the implementation architecture. Here is a generic architecture of such a synchronization flow.
We assume that the Event Generator is either part of the Source Data Store or tightly coupled with it, so that each change generates a single event and the events are generated in the same order the changes are executed in the Source Data Store.

The event generated by the Event Generator can either be the action performed on the even store ( e.g., an INSERT or UPDATE statement) or change in the store state ( e.g., list of modified rows).

The Event Queue serves as a buffer, so that events can be generated by the Event Generator as fast as they occur, regardless of how fast they are processed by the Event Processor.

Finally, the event processor updates the Destination Data Store using one of the above strategies.

Let's contrast the two approaches by looking at the following aspects.

### Events

What kind of events are we able to receive from the source? Each event describes a transition from some state s to a state s' of the source. As described in our formal treatment of the problem above, this can be in terms of actions or in terms of state differences.

Examples of action representation are verbatim SQL statements executed in the source system or HL7 SIU messages (see https://corepointhealth.com/resource-center/hl7-resources/hl7-siu-message/).
`DELETE FROM pets WHERE dob < 2000-01-01`
On the other hand, the state difference will give us detail of what rows (in the database example) have been modified.
`{  event_type: DELETE,  values: { id: 3, name: Shiva, kind: cat, dob: 1999-03-01 }}`
Clearly, the former makes it difficult to infer what part of the source state was affected by the change. It is, however, much more compact in cases where the change affects many rows. For a MySQL discussion of the differences, see https://dev.mysql.com/doc/refman/8.0/en/replication-sbr-rbr.html.

### Performance

In a naive implementation, the Event Processor will process each event serially in the same order it was generated by the Event Generator. This implies that the events must be processed (on average) as fast as they are generated—which may not always be possible, effectively putting a cap on the system's throughput. Typical factors that you should consider are:
• Writes to the Destination Data Store may be slower than writes to the Source Data Store. For example, Destination Data Store may be read optimized at the expense of writes (e.g., AWS Redshift).
• Event processing may be non-trivial and require extra time to process. For example, the processing may include calling additional service.
• Writing to the destination store may not be parallelizable. This may be due to Destination Store limitations or in order to preserve semantics (see replicating actions example below).
Replicating state is generally more expensive than replicating actions, as it requires additional queries to the Source Data Store. However, for complex transformations between the source and the destinations, this may be a good tradeoff.

To get around the problem of keeping up with an incoming stream of events, we may want to optimize the system by:
• processing events in batches, as it is often more efficient to process multiple records at once than one-by-one; or
• processing events in parallel.
The latter approach will inevitably result in events being processed in a different order than they were generated in. Replicating the state is generally more resilient to processing order than replicating actions.

### Processing order

Consider the following changes are made to the source data:
`UPDATE pets SET name='Jill' WHERE id = 7UPDATE pets SET name='Jack' WHERE id = 7`
We expect the resulting state to be that the cat with record id 7 is named 'Jack'. However, if we parallelize processing of the events generated by these updates, depending on the order we process them, we will either get:
`{ "name": "Jack", "kind": "cat", "dob": "2015-03-01" }`
...or:
`{ "name": "Jill", "kind": "cat", "dob": "2015-03-01" }`
If we, however, replicate the state instead of the actions, the following is a possible sequence that shows the approach's invariance with respect to the execution order.
`-- first updateUPDATE pets SET name='Jill' WHERE id = 7`
`-- second updateUPDATE pets SET name='Jack' WHERE id = 7`
`-- process second eventSELECT * FROM pets WHERE id = 7PUT pets/_doc/7{ "name": "Jack", "kind": "cat", "dob": "2015-03-01" }`
`-- process first eventSELECT * FROM pets WHERE id = 7PUT pets/_doc/7{ "name": "Jack", "kind": "cat", "dob": "2015-03-01" }`
Even though the events have been processed in the reversed order, the resulting state is correct. The only downside is that we have made the same change to the Destination Store twice. This can be mitigated by batching up updates, thus trading latency for throughput.

Note: In the above discussion, we made an assumption that processing each event is atomic. This assumption is rarely true in the real world. For example, the event processing above can have the following sequence:
`-- first updateUPDATE pets SET name='Jill' WHERE id = 7`
`-- (start) process first eventSELECT * FROM pets WHERE id = 7`
`-- second updateUPDATE pets SET name='Jack' WHERE id = 7`
`-- process second eventSELECT * FROM pets WHERE id = 7PUT pets/_doc/7{ "name": "Jack", "kind": "cat", "dob": "2015-03-01" }`
`-- (finish) process first eventPUT pets/_doc/7{ "name": "Jill", "kind": "cat", "dob": "2015-03-01" }`
We can prevent this sequence by locking the source row we read (so that processing the second event will block until we finish processing first event), but locking is generally counterproductive when trying to increase throughput through parallel processing.

An alternative method is to partition the source state so that events that are processed in parallel refer to different parts of the state ( e.g., disjoint set of records). For example, one can shard the space based on a set of keys in the source tables such as tenant or user ID. Such partitioning may be difficult when replicating actions. It is much harder to determine that two actions in the source system affect disjoint sets of records.

### Idempotency

Replicating state has an additional advantage, as is clear from the above example. It is idempotent: the same event can be processed multiple times with no ill effect. This allows the Event Queue to relax the delivery guarantee from exactly once to at least once (AWS SQS is an example of such a queue).

The same property makes error handling easier.

### Transformations

Another difference between the approaches is in how they handle transformation between the source and destination when they are not 1-1. For example, say the source contains one more table, owners, and the pets table has a foreign key owner_id into the owners table. In our destination data store, we want to flatten this structure by repeating the owner's information in each pet document. For example, if John owns two pets, the destination records may look like:
`{ "pet_name": "Joe", "kind": "snake", "owner_name": "John", "citi": "Boston", ... }{ "pet_name": "Rex", "kind": "Dog", "owner_name": "John", "citi": "Boston", ... }`
In this scenario, we want to change the owner's city from Boston to Cambridge. The update would be:
`UPDATE owners SET city = Cambridge WHERE id = 33`
...and the corresponding event:
`{  event_type: UPDATE,  table: owners,  before_values: { name: John, city: Boston, ... },   after_values: { name: John, city: Cambridge, ... }}`
In the first approach (replicating action), we would need to translate the action above to a corresponding action that would update both documents. Note that finding corresponding actions may be difficult or impossible. Thus complexity of the transformation is a key decision point in determining which strategy to use.

However, if we use the state replication approach, we can simply do:
`SELECT * FROM pets LEFT JOIN owners WHERE owners.id = 33`
...and update destination based on the response of the query:
`PUT pets/_doc/1{ "pet_name": "Joe", "kind": "snake", "owner_name": "John", "citi": "Cambridge", ... }PUT pets/_doc/2{ "pet_name": "Rex", "kind": "Dog", "owner_name": "John", "citi": "Cambridge", ... }`
Note: in both cases, depending on the transformation, a single change in the source may result is a change to a large number of records/documents in the destination store.

Another consideration is whether the transformation loses some information from the source. When mapping external systems, we may not have access to complete internal representation. When faced with this kind of lossy transformation, keeping the destination exactly synchronized with the source by replicating actions is impossible.

### Deletions

Replicating state when records are deleted is a bit more difficult, given that we cannot query the state of what is not there. In this case, we need to find the image in the Destination Store corresponding to the deleted source records and delete those destination records.

### Transactions

Finally, we should consider how the system treats transactions. If the source or destination are not transactional, this may be a moot point. However, assume that both systems are transactional ( e.g., relational databases). Ideally, we would like to ensure that the destination is never in a state that does not correspond to some state of the source.

The most strict requirement is that if the source is in a set of states s1, s2, ... sn, the destination will be in exactly the same set of states T(s1), T(s2), ..., T(sn).

We can relax this constraint to allow destination skipping of some intermediate states. For example, if the source transitions through states s1, s2, ..., sn, then the destination should transition to some subsequence of states T(s1), T(s2), ..., T(sn). In other words, it is allowed to skip some intermediate states. This relaxation is useful when we want to process multiple changes in batches, or in parallel.

## Conclusion

Both replicating state and replicating actions can be viable options when we need to keep data in sync. While replicating actions is simpler and more efficient, it is better suited for simple transformations and established fixed sharding strategy.

Replicating state instead of actions provides the advantage that it is more tolerant to the processing and idempotent. This yields a system that is easier to scale and more robust.

Here are some key question to ask:

What events are pushed out by the source system (state or actions)?

If state is pushed, replicating state may be easier than reverse engineering actions. Likewise, if actions are pushed, replicating actions may be easier. Can you modify the source system to better suite your case?

Topics: Data engineering

Before I started at Manifold, I knew a little about the machine learning (ML) space, but wanted a better grounding in it. I asked CEO Vinay Seth Mohta for some more information, and found the resources he shared tremendously helpful. My research turned up some additional resources of my own, as well.

The following resource compilation includes those items, as well as a few added by others on the Manifold team. We hope to continue updating and improving this list, and may reshare it out periodically in the hopes that others who embark on this journey can have an even smoother and more fulfilling experience.

Topics: Machine learning

## Introduction

The last five years have seen many new developments in reinforcement learning (RL), a very interesting sub-field of machine learning (ML). Publication of "Deep Q-Networks" from DeepMind, in particular, ushered in a new era. As RL comes into its own, it's becoming clear that a key concept in all RL algorithms is the tradeoff between exploration and exploitation. In this post, we will simulate a problem called the "multi-armed bandit" in order to understand the details of this tradeoff.

Topics: Machine learning

As the artificial intelligence field continues to grow, businesses across the country have found that techniques are coming out of the research lab and into the applied realm to benefit their operations.

You probably use computer vision every day and don’t even think about it. Enjoy checking out the latest Snapchat filters? That’s computer vision. Unlock your iPhone with your face? That’s computer vision, too. Use your phone to deposit your latest paycheck and get some cash in your bank account? Well, that’s also computer vision.

Computer vision as we know it is at a tipping point. Thanks to industry-wide development efforts and advances in deep learning algorithms and graphics processors, we’re doing things that were unimaginable just a decade ago.

Topics: Computer vision

By Kyle Seaman, Head of Product at Sentenai, and Sourav Dey, Co-Founder and CTO at Manifold

Predictive analytics is an undeniably valuable technology, with research indicating its market size could top \$12 billion USD by 2022. Across a range of industries, businesses, and applications, using historical data to predict future outcomes can lead to greater operational efficiency in a variety of ways. Predictive analytics can enable organizations to streamline their operational processes, optimize their demand forecasting, drastically reduce downtime, and better understand their customers’ propensity to buy.

Topics: Data science

Healthcare providers have historically gathered only clinical data, collecting information when patients needed health services. But collecting information across a broader spectrum provides a more holistic view of a patient and can improve strategies for providing care. Moreover, it can be an incredibly useful indicator of a patient’s long-term health.

Topics: Healthcare

Joseph Goldbeck has recently joined Manifold as a Data Engineer. In this capacity, he will help clients build production data pipelines to prepare for machine learning and flexibly answering business questions, which can involve migrating from legacy systems to modern, cutting-edge technologies. Joe is a backend engineer with extensive experience leading teams and collaborating closely with business stakeholders.

Topics: News

Dr. Jakov Kucan has recently joined Manifold as a Senior Architect. In this capacity, he will be assisting in project delivery for our clients. He understands at a deep level that realizing business value from AI applications means integrating those application into production systems. He helps our clients think through and engineer AI solutions as part of their broader production systems.

Topics: News

Do you ever feel like machine learning is moving so fast that it’s impossible to keep up? You’re not alone — that’s what the hype cycle has lots of people thinking.

Hype bubbles seem to build up every few years around a specific technology, like the cloud, big data, or, in this case, artificial intelligence.

Topics: Data science