Why Gemfury? Push, build, and install  RubyGems npm packages Python packages Maven artifacts PHP packages Go Modules Debian packages RPM packages NuGet packages

Repository URL to install this package:

Details    
  bin
  lib
  Gemfile
  README.md
  uoy-faculty-datasync.gemspec
Size: Mime:
  README.md

DataSync

This gem provides data synchronization facilities for databases.

Installation

Add these lines to your application's Gemfile:

source 'https://gem.fury.io/universityofyork/' do
  gem 'uoy-faculty-datasync', '~> 0.1', require: 'datasync'
end

And then execute:

$ bundle

Or install it yourself as:

$ gem install uoy-faculty-datasync

Usage

The gem fundamentally is about synchronizing data between databases. There are four types of mapper - implicit, source table, target table, and link table.

Mapper types

There are fundamentally four mapper types:

  • ImplicitMapper uses the ID column in both records to imply links between the objects. For example, usernames in the source and target system.
  • TargetLinkMapper uses a column in the target table to refer to the source object it's linked to. The column must be able to store the source record ID in its native form, and must be nullable. If a default value is set it should be NULL.
  • SourceLinkMapper is similar to TargetLinkMapper but stores the target record ID in the source table. The same constraints exist for the link column.
  • LinkTableMapper uses a separate table to store the links independently of the source and target records. It must contain three columns, none of which are nullable:
    • source_id - the source ID in its native format
    • target_id - the target ID in its native format
    • mapping - the mapping name as text

Defining a mapping

To define a mapping, use the DSL:

require 'datasync'

my_mapping = Datasync.create_mapper(Datasync::ImplicitMapper) do
  title 'Implicit mapping'
  source(:username) { DB[:source].where(active: true) }
  target(:username) { DB[:target] }

  match_using :eppn
  match_when :unlinked_source, :unlinked_target

  limit unlinked_target: 100, total: 10000

  parser(:forename, &:capitalize)
  parser(:surname, &:capitalize)
  mapping do
    map username: username
    map eppn: [username, 'york.ac.uk'].join('@')
    map display_name: [forename, surname].join(' ')
  end
  formatter(:username, &:downcase)

  on(:unlinked_source) { insert }
  on(:linked) { update }
  on(:matched) do
    link
    update
  end
  on(:unlinked_target) { delete }
end

Most of the options are mandatory:

  • title specifies a human-readable title
  • source and target set the source and target unique ID column, and a block which returns the source or target data as a Sequel::Dataset
  • match_using specifies the mapped object fields to use for matching
  • match_when specifies which records can be eligible for matching
  • limit allows you to limit the total number of links permitted in each state, as well as the overall total or number of errors.
  • parser(:field) { |value| ... } specifies a block used to parse a source field
  • mapping specifies the field mapping in terms of parsed source object fields, which are in scope. Additionally, @source and @target are in scope for the input and output, respectively.
  • formatter(:field) { |value| ... } specifies a block used to format a target field
  • on(:situation) { ... } specifies how to handle a particular situation. The helpers insert and update use the in-scope object and take hash arguments to append metadata to the target object if you need to. link and unlink affect linking where it is used, and delete is available to remove the target. You can also use raw_source, source and object to get the raw source object, parsed source object, and mapped object, and target_row as a Sequel::Dataset. Both @source_id and target_id are also in scope.
  • target_key_from :object specifies that the transformed object contains the target id column already
  • target_key_from :database specifies that the target id column is produced by the database on insert (typically an autoincrementing primary key). This is the default.

Synchronization situations

The following situations can occur in all mappers:

  • unlinked_source: A source record exists but is not linked to a target
  • unlinked_target: A target record exists but is not linked to a source
  • linked: A source and target record exists and are linked
  • matched: A source and target record which were eligible for matching have been matched up, and unlinked if they were previously linked to other records

The following situations can only occur in certain mappers:

  • missing_target: A source record exists and is linked to a target record which is missing (occurs in LinkTableMapper, SourceLinkMapper)
  • missing_source: A target record exists and is linked to a source record which is missing (occurs in LinkTableMapper, TargetLinkMapper)
  • orphan: A link exists, but both the source and target records are missing (occurs in LinkTableMapper only)

Running a synchronization

Once you have a defined mapping, you can call my_mapping.sync! on it. This synchronizes the data according to the specified rules, and returns the SyncRunner object, which you can interrogate for information:

  • The total sync time in seconds: #runtime
  • The start and end times: #start_time, #end_time
  • The total number of each link situation, plus the overall total and number of handler errors: #totals
  • The raw link states, as a hash: #link_states
  • Whether any errors occurred: #ok?
  • All errors from the run: #errors
  • Whether the sync aborted: #aborted?
  • The error which aborted the sync: #terminating_exception
  • A report (as an array of strings): #report

Additional mapper options

There are some options which can be defined in the create_mapper block for some of the types:

  # Options for TargetLinkMapper
  option link_column: :source_id

  # Options for SourceLinkMapper
  option link_column: :target_id

  # Options for LinkTableMapper
  option(:link_table) { DB[:link] }
  option mapping_name: 'test-mapping'

Link column specifies the column which will be managed by Datasync to store the references to the other table.

Link table is used to specify a table (Sequel::Dataset) that Datasync can use to store the links between source and target records itself. It also contains a column so that several mappings can be stored in one table, where the IDs are of the same type.

Development

You can run the spec standalone against an in-memory SQLite3 database. To do this, run bundle install and then bundle exec rake spec.

To run against a postgres image (as in CI), first install the bundle, using docker compose run --rm app bundle install, and then run the spec with docker compose run --rm app. Between runs, run docker-compose down to remove the database so you have a fresh empty copy for every run.

Contributing

The repository is at https://github.com/university-of-york/faculty-dev-datasync-gem