Many companies receive their data from multiple sources and often store it in different databases. These databases may require to be synchronized at the particular time basis (daily, hourly, etc). The main challenge of such sync task is to achieve almost real-time update of modified data in both source and target databases.
Straight forward approach to database synchronization involves full scanning of the original and the destination databases to find all the data that was added, modified or removed. After this procedure the following actions are performed over selected rows:
- Insert all records from source tables that are missing the destination database
- Delete all records missing in source tables from the target database
- Update all records that have been modified
The main issue of this method is that the synchronization process can take extremely long time on large databases. Let’s consider two databases living on SQL Server and MySQL and containing more than one million rows. The average performance of the synchronization process described above is about 100 rows per second and it cannot be improved since this method required row-by-row analysis over the data. Every pass of synchronization going this way takes over 2 and a half hours for databases considered above, which makes it impossible to run on an hourly basis.
Fortunately, there is another approach to real-time synchronization. The main idea is to synchronize only those rows which has been modified since the last run of synchronization process. This method can be implemented using algorithm that is known as incremental or trigger-based synchronization:
- When running the first time the program does straight forward database synchronization if it is necessary
- Also, triggers on insert, delete and update are generated for every table being synchronized. Each trigger writes information about modified rows into a special service table
- Starting from the second run the program gets information about modified rows from the service table and updates these rows in the destination database
Now, it’s time to estimate how incremental method can reduce the duration of sync procedure on the example of MS SQL and MySQL databases considered above. The incremental synchronization can be run in near to real time mode even with record-by-record processing, since it will have to process the modified part of database instead of iterating millions of rows. However, synchronization performance can be increased even more via bulk processing method. It is possible to implement this method because all modified data is stored in one table with order by type of required modification – insert, delete or update. The bulk processing can increase performance of the synchronization process in more than 10 times.
Incremental synchronization is the most efficient way to keep all data up-to-date but it also applies some restrictions on source and destination databases. Trigger-based synchronization method requires:
- sufficient privileges for synchronized databases to create triggers and service table
- each table being synchronized must have primary key or unique index
The overall advantages of incremental synchronization can be tested using SQL Server and MySQL Sync tool developed by Intelligent Converters.
Comments