In this post, I will cover the real-world experience for the following case:
The Source: While in theory, it does not really matter what your source is (you can use a very similar approach for the majority of the most popular DBs), Oracle was the Database system of choice in big corporate companies for many years, and that is where my focus will be. The Target: No reason why to be specific on this side. You can pick any target database in AWS, and the approach will still fit. The Mode: You can have a full refresh or incremental refresh. A batch data load (source and target states are delayed) or (near) real-time data load. Both will be touched on here. The Frequency: You might want one-time migration followed by a full switch to the cloud or require some transition period and having the data up to date on both sides simultaneously, which implies developing daily synchronization between on-premise and AWS. The former is simpler and makes far more sense, but the latter is more often requested and has far more break points. I will cover both here.
Problem Description
The requirement is often simple: Before going any further, there is something to consider:
Do not jump into the idea of “just copy what we have and deal with it later” too fast. I mean, yes, this is the easiest you can do, and it will be done quickly, but this has the potential to create such a fundamental architectural problem that will be impossible to fix later without serious refactoring of the majority of the new cloud platform. Just imagine the cloud ecosystem is completely different from the on-premise one. Several new services will be introduced over time. Naturally, people will start using the same very differently. It is almost never a good idea to replicate the on-premise state in the cloud in a 1:1 fashion. It might be in your particular case, but be sure to double-check this. Question the requirement with some meaningful doubts like: Who will be the typical user utilizing the new platform? While on-premise, it can be a transactional business user; in the cloud, it can be a data scientist or data warehouse analyst, or the data’s main user might be a service (e.g., Databricks, Glue, machine learning models, etc.). Are the regular day-to-day jobs expected to stay even after transitioning to the cloud? If not, how they are expected to change? Do you plan substantial growth of data over time? Most likely, the answer is yes, as that is often the single most important reason to migrate into the cloud. A new data model shall be ready for it. Expect the end user to think about some general, anticipated queries the new database will receive from the users. This will define how much the existing data model shall change to stay performance-relevant.
Setting up the migration
Once the target database is chosen and the data model is satisfyingly discussed, the next step is to get familiar with AWS Schema Conversion Tool. There are several areas in which this tool can serve: Now there are a few tips for using the Schema Conversion Tool. Firstly, it should be almost never the case to use the output directly. I would consider it more like reference results, from where you shall do your adjustments based on your understanding and purpose of the data and the way how the data will be used in the cloud. Secondly, earlier, the tables were probably selected by users expecting quick short results about some concrete data domain entity. But now, the data might be selected for analytical purposes. For example, database indexes previously working in the on-premise database will be now useless and definitely not improve the performance of the DB system related to this new usage. Similarly, you might want to partition the data differently on the target system, as it was before on the source system. Also, it might be good to consider doing some data transformations during the migration process, which basically means changing the target data model for some tables (so that they are not 1:1 copies anymore). Later on, the transformation rules will need to be implemented into the migration tool.
Configuring the migration tool
If the source and target databases are of the same type (e.g., Oracle on-premise vs. Oracle in AWS, PostgreSQL vs. Aurora Postgresql, etc.), then it is best to use a dedicated migration tool that concrete database supports natively (e.g., data pump exports and imports, Oracle Goldengate, etc.). However, in most cases, the source and target database won’t be compatible, and then the obvious tool of choice will be AWS Database Migration Service. AWS DMS basically allows configuring a list of tasks on the table level, which will define:
What is the exact source DB and table to connect into? Statement specifications that will be used to obtain the data for the target table. Transformation tools (if any), defining how the source data shall be mapped into target table data (if not 1:1). What is the exact target database and table to load the data into?
The DMS tasks configuration is done in some user-friendly format like JSON. Now in the simplest scenario, all you need to do is to run the deployment scripts on the target database and start the DMS task. But there is far more to that.
One-time Full Data Migration
The easiest case to execute is when the request is to move the whole database once into the target cloud database. Then basically, all that is necessary to do will look like the following: If the configuration of DMS is done good, nothing bad shall happen in this scenario. Every single source table will be picked up and copied over into the AWS target database. The only concerns will be the performance of the activity and making sure the sizing is right in every step so that it will not fail due to insufficient storage space.
Incremental Daily Synchronization
This is where things start to get complicated. I mean, if the world would be ideal, then it’d probably work just fine all the time. But the world is never ideal. DMS can be configured to operate in two modes:
Full load – default mode described and used above. The DMS tasks are started either when you start them or when they are scheduled to start. Once finished, the DMS tasks are done. Change Data Capture (CDC) – in this mode, the DMS task are running continuously. DMS scans the source database for a change on the table level. If the change happens, it immediately tries to replicate the change in the target database based on the configuration inside the DMS task related to the changed table.
When going for CDC, you need to make yet another choice – namely, how the CDC will extract the delta changes from the source DB.
#1. Oracle Redo Logs Reader
One option is to choose native database redo logs reader from Oracle, which CDC can utilize to get the changed data, and, based on the latest changes, replicate the same changes on the target database. While this might look like an obvious choice if dealing with Oracle as the source, there is a catch: Oracle redo logs reader utilizes the source Oracle cluster and so directly affects all the other activities running in the database (it actually directly creates active sessions in the database). The more DMS Tasks you have configured (or the more DMS clusters in parallel), the more you will probably need to upsize the Oracle cluster – basically, adjust the vertical scaling of your primary Oracle database cluster. This will surely influence the total costs of the solution, even more so if the daily synchronization is about to stay with the project for a long period of time.
#2. AWS DMS Log Miner
Unlike the option above, this is a native AWS solution to the same problem. In this case, DMS does not affect the source Oracle DB. Instead, it copies the Oracle redo logs into the DMS cluster and does all the processing there. While it saves Oracle resources, it is the slower solution, as more operations are involved. And also, as one can easily assume, the custom reader for Oracle redo logs is probably slower in its job as the native reader from Oracle. Depending on the size of the source database and the number of daily changes there, in the best-case scenario, you might end up with almost real-time incremental synchronization of the data from the on-premise Oracle database into the AWS cloud database. In any other scenarios, it still won’t be near real-time synchronization, but you can try to get as close as possible to the accepted delay (between source and target) by tuning source and target clusters performance configuration and parallelism or experimenting with the amount of DMS tasks and their distributing between the CDC instances. And you might want to learn which source table changes are supported by CDC (like the addition of a column, for example) because not all possible changes are supported. In some cases, the only way is to make the target table change manually and restart the CDC task from scratch (losing all existing data in the target database along the way).
When The Things Go Wrong, No Matter What
I learned this the hard way, but there is one specific scenario connected to DMS where the promise of daily replication is hard to achieve. The DMS can process the redo logs only with some defined speed. It does not matter if there are more instances of DMS executing your tasks. Still, each DMS instance reads the redo logs only with a single defined speed, and each one of them must read them whole. It even does not matter if you use Oracle redo logs or AWS log miner. Both have this limit. If the source database includes a large number of changes within a day that the Oracle redo logs get really crazy big (like 500GB+ big) every single day, CDC is just not going to work. The replication will not be completed before the end of the day. It will bring some unprocessed work to the next day, where a new set of changes to be replicated is already waiting. The amount of unprocessed data will only grow from day to day. In this particular case, CDC was not an option (after many performance tests and tries we executed). The only way how to ensure at least all delta changes from the current day will be replicated on the same day was to approach it like this:
Separate really big tables that are not used so often and replicate them only once per week (e.g., during weekends). Configure replication of not-so-big-but-still-big tables to be split between several DMS tasks; one table was eventually migrated by 10 or more separated DMS tasks in parallel, ensuring the data split between the DMS tasks is distinct (custom coding involved here) and execute them daily. Add more (up to 4 in this case) instances of DMS and split the DMS tasks between them evenly, which means not only by the number of the tables but also by the size.
Basically, we used the full load mode of DMS to replicate daily data because that was the only way how to achieve at least same-day data replication completion. Not a perfect solution, but it is still there, and even after many years, still works in the same way. So, maybe not that bad a solution after all. 😃