Provide a configurable structure for modularized information processing
Complicated data processing involves many distinctive and repetitive steps. Each of these steps can be mapped to a software module that is independently developed and assembled for particular cases of data processing.
Given the elastic nature of cloud computing, it’s a perfect platform for data processing. We need a solution that is flexible in two ways:
1. Modularized components for data processing;
2. Configurable so that different modules can be re-used easily in various cases.
Define the key steps of processing and map each of them to a virtual machine. For better performance, one option is to have multiple VMs for one step but that may mean more work on coordination.
While deciding what steps are there, you need to think about not only the current project but also re-use of the corresponding virtual machines for future projects. It’s helpful to look beyond the immediate project and consider it from the perspective of the data processing and come up with re-usable virtual machines.
The topology of the pipeline is pretty straight forward. You have virtual machines in a line as shown below:
A variation of this serial structure illustrated above is to have multiple VMs in one step as a cluster to improve performance. All the virtual machines in one cluster will be the same configuration to simplify management.
As you can see, you need to flow information from one step to its next step. There are two types of flows: control flow and data flow. The control flow signals the completion of the current step so that its next step can get started. The data flow is the processed output data from the current step to its next step as input.
There are two different strategies for passing information. You can do it from a centralized repository or stream from one VM to the next. Here are the main things to consider:
2. Indirect messaging. It leverages standard messaging queues and decouples the sender (current VM) and receivers (next VMs). An added benefit is that the messages can be stored and delivered later in case the receiver is busy or offline.
3. DB sharing. It uses standard databases as the common information store. All the VMs involved in the processing should have access to the database.
5. Cloud storage. With this approach, you open an account with a service provider and store your data there.
So which one should you choose for your project? From a business perspective, these options all come with varying costs for development, initial setup and on-going maintenance. From a technical perspective, the major technical considerations in your choice are:
1. Amount of data. When there are huge amounts of data, you don’t want to move data from one VM to another very often. In that case, having a centralized database makes a lot of sense. Whenever you have a centralized data accessed by multiple parties, synchronization becomes critical for smooth processing.
2. Timing requirement. If you have a demanding requirement on timing, you may want to consider direct messaging and sacrifice the benefits of decoupling your VMs.
As mentioned early, the control and data flows are two different flows. You can choose a different approach for either of them. For example, you choose messaging queue to pass on control messages and database for data flow.
If you don’t have complicated data processing, the simple and autonomous pipeline architecture may be good enough. If you have complicated data processing – for example, multiple pipelines interwoven together and data flows into different routes depending on current processing result. If that is the case, you may need a management application to manage and coordinate the processing.
Consider a VM pipeline pattern when you want to:
- Divide data processing into multiple steps that can map to virtual machines;
- Freely assemble and configure specialized VMs as steps of information processing;
- Easily isolate and encapsulate processing logics into VMs;
- Better control the resource allocation and management for processing efficiency;
- Design your system for better scalability.
Mapping each step into a virtual machine instead of a process has extra overheads that range from more storage consumption and slower performance due to runtime VM switching. You will also need extra IP addresses and see more network traffic. To manage these virtual machines, you may need to acquire or upgrade your management system. These are costs you should consider when considering this approach for scalable data processing.
It’s a common practice to use a process pipeline pattern for information processing. I personally haven’t read about use of this approach at the virtual machine level. If you have any examples to share, please feel free to leave a comment!
VM Factory: create new VM instances for each step.
Stateless VM: minimize the management of the VMs especially being used in one step.