Optimal Sync

From Optimal BPM
Jump to: navigation, search

Contents

About

Optimal Sync is a multi-platform[1] BPM utility[2] that is able to transfer data between different sources and destinations. It also able to do some transformations. There are some screenshots here.

Workflow

To merge the source data into the destination data do the following:

  • Select source (file, spreadsheet, web page or database table)
  • Select destination (file(can be html), spreadsheet or database table)
  • Map source fields to destination fields
  • Select key fields(you don't have to sometimes but it is usually better to)
  • If necessary, make some transformation. (Remember that you can iterate through the data using the arrow keys)
  • Select what actions the merge should make to the destination data(insert, update, delete)
  • Preview the merge result (what would happen if you committed now)
  • Save the definition XML
  • Either commit the merge later via scripting(optimal_sync -d definition.xml) or click "Commit merge" to do it right away.


Data sources/destinations

  • Database servers:
    • PostgreSQL
    • MySQL
    • SQL Server
    • IBM DB2
    • Oracle
  • File formats
    • Flat file formats(csv)
      • All option exposed from python csv, (fixed width files scheduled for 0.9)
    • XML
      • Flexible mapping using row level XPath and a relative field-level XPaths.
    • HTML
      • Same mapping as XML, ability to read badly formed HTML
    • Spreadsheets
      • Excel(only reading, writing scheduled for 0.9)

Actions

There are three possible actions on each key field match or non match:

  • insert - rows existing in the source but not in the destination is added
  • update - rows existing in both structures are updated if different
  • delete - rows existing in the destination but not in the source are deleted

If all these actions are set to occur, the result will be that the destination dataset is made into a duplicate of the source. This given that all fields are mapped, since the update action only affects the mapped fields.


  • Updates and deletes can only occur if keys are specified.
  • Inserts can occur without keys if the target is empty. It is however good form to use keys anyway.
  • If the destination file or database table doesn't exist, it will be created. Database table column data types are created based on the source data using the least possible amount of intelligence, the data in the first row.

Field mappings

Fields are mapped using a reference string:

  • in a database or file with header, it is a field name
  • in a header-less file it is a column index reference
  • in XML/HTML it is an XPath.

The XPath usage for XML/HTML might need some explaining, since id can reach several levels of data, greatly simplifying the iteration.

XML mapping example, using some customer data:

/customers/customer - For each of this node under this node, a new row is generated.
                   firstname - A relative path to the  /customers/customer/firstname node value
                   contact/phone/home - A relative path to the /customers/customer/contact/phone/home node value
                   contact/phone/work - A relative path to the /customers/customer/contact/phone/work node value

This setup would loop all customers and map the three references to, for example, a flat database-or file structure. Of course, it goes both ways.

Transformations

For each field mapping, it supports an infinite chain of the following transformations:

  • Replace - Replace string A with B an option C number of times
  • Regular Expressions Replace - Replace pattern A with value B an option C number of times
  • Cast - Cast from one data type to another, optionally using format strings
  • If empty - If a field is empty insert the provided value
  • Trim - Trim whitespace around value, optionally specifying beginning, end or both.

Substitution

Use substitution variables

  •  ::identity::
  •  ::uuid:: A random uuid (GUID).
  •  ::username:: The name of the logged in user
  •  ::curr_datetime:: The current date and time

GUI

If optimal_sync.py is run with the -e parameter, the GUI is shown(if applicable to the environment, of course. This is under Ubuntu 14.04 1 LTS, more platforms will be added) OS csv to csv.png

The application has an easy-to-use GUI in which the XML files are built.

Scripting

If optimal_sync.py is run without the -e parameter, it just runs the merge/replication and not bringing up a GUI, acting as a headless script. It takes the following parameters:

-d, --definition file    Replication to be run
-e,                     Initialize editor                               
-l, --log_level         Log level
--help     display this help and exit

Usage

Please start by downloading the example files from Sourceforge.

unzip optimalbpm-tools-examples.zip

Since the user is not likely to have the servers installed on the local host, they are likely to all be working, rather they should be regarded as inspiration.

Creating a definition file

Clicking "Load" loads the definition file. "Save" saves it. To create a completely empty definition do just:

optimal_sync -e

...and edit to your liking(combine -e with -d and it opens the specified file directly).

Non-obvious things in the GUI

  • Source/Destination dataset type defines the kind of datasource -or destination.
    • RDBMS - Relational databases
      • Scan and select uses NMAP to scan the network defined in "IP(for subnet scan).
        If you are in an enterprise setting, it is advised to check with your IT so that port scanning alarms aren't triggered.
      • The database is a drop-down box not because it actually shows the available databases, but because it used to.
    • XML/XPATH
      • Rows XPATH refer to parent node or a list of nodes which constitute the rows in the dataset.
        W3Schools has a great tutorial on XPath, it is really not difficult. Also see the xpath-example.
    • Flatfile
      • The properties of flatfile are basically the relevant properties of the standard python CSV-module
    • Spreadsheet
      • It is currently only possible to read from spreadsheets until version 0.9.
        And only Microsoft Excel-spreadsheets to boot. However this is likely to be the most common use case.
  • Mappings are connections between data in the source dataset and their destinations in the destination dataset.
    • For XPath datasets, mappings are relative to the row node.
    • For merges to work properly, it is always neccessary to define one or more key fields (the checkbox, currently max 3) that can be used to identify rows in the source and the destination.
      In short, the system cannot with certainty see if a row already exists in the destination dataset.
  • Transformations - shown in the right pane when a mapping is klicked, they detail what changes is to be done with a field between its source and its destination.
    • Common usages of transformations
      • Cleaning up strings, like removing "mr." from "mr. Smith". (using the "Replace" transformation)
      • Making key fields match, like trimming extra charactes off an id. (using the "Trim" transformation)
  • Preview merge - simulates the merging of the data from the source into the destination.
    • The "Delete", "Insert" and "Update"-checkboxes control what actions the merge is allowed to do. Checking
      • Insert only adds new data
      • Insert and Update add new and updates existing data in the mapped fields.
      • Only Delete removes all in Destination that doesn't exist in source.
      • Insert, Update and Delete makes the destination look like the source dataset(in the mapped fields)
      • There is no "Commit" button for security reasons. Anyway it is very easy to run as a script (optimal_sync -d filename.xml).
        Probably a commit button will be added when logging has been sorted, likely in 0.9.
    • The window in the bottom details what happened what was affected and what action would be carried out if run.
  • Annoyances
    • If the datatypes of the key fields in the respective fields do not match, you might get an unorderable field error(will be fixed in 0.9).
  • Gotchas
    • Think carefully of when to use or not to used relative path names. Some files might be relative to the position of the definition file(../../files/input.xml), and some in a specific position in the system(/var/files/input.xml).
    • Dont forget to subscribe to product updates and if there is stuff you don't understand, create a ticket and someone is likely to answer.

Execute

Using the the -d - parameter without the -e parameter executes the script and carries out the changes:

optimal_sync.py -d web_test.xml

Installation

Linux, from packages

As there is not yet a repository available, to avoid having to walk through all dependencies manually, using gdebi is recommended*, to install gdebi, use:

sudo apt-get install gdebi

Install packages, available above, in the following order:

sudo gdebi python3-qal_0.4.0_all.deb
sudo gdebi optimalbpm-tools_0.9.0_all.deb

There are also some example files, use unzip to unpack:

unzip optimalbpm-tools-examples.zip

Notes:

  1. Implicitly, the install requires Ubuntu Trusty or Debian Jessie, or any other distribution with python3-postgresql 1.1.0.

(This because python3-postgresql 1.0.2 is broken: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=724597)

  1. gdebi has quite a lot of dependencies, so on a very tight system you might want to use dpgk -i and manually install

dependencies instead.

Windows, installation files

To install, first uninstall any earlier versions, then download and run optimalbpm-tools-bundle_0.9.0.exe, which installs the Optimal BPM Tools package and its proper dependencies(see below).

To install example files, select "custom" installation and choose "Will be installed on local hard drive". Don't forget to choose and install location, it is recommended to choose a non-readonly location where one can easily access the files, like the home folder.

Notes:

  1. The Oracle and DB2 client libraries has to be installed manually until the legality of redistribution has be researched.
    See: http://www.optimalbpm.se/wiki/index.php/QAL.DAL for instructions for some versions.
  2. These packages are installed:
    1. Python 3.4 (into its default directory at C:\Python34).
    2. prerequisites.0.9.0 - A staging package that install binary lxml and pyodbc* and fixes paths
    3. Using pip install, it installs QAL, Py-Postgresql, PyMySQL, xlrd, xlwt** and openpyxl

'*) The current lxml and pyodbc easy_install packages require visual studio components to be be installed.
'**) The xlwt-installation doesn't work for python 3 yet, however the pip-package will soon work. Optimal BPM has been tested with the python 3 branch. This means that old-timey .xls can't be written to yet without some manual work.

From source

git clone git://git.code.sf.net/p/optimalbpm/code optimalbpm-code

Dependencies

Optimal Sync, as the rest of Optimal BPM, is a Python 3 application, which means that it requires python 3 to be installed. Recommended is version 3.4, as that includes setuptools and pip.

choco install python3 easy.install pip ActiveTcl nmap
  • Python packages(use pip3 if you have several python versions installed):
pip install qal
pip install py-postgresql
pip install pymysql
pip install xlrd
pip install xlwt
pip install openpyxl
pip install cx_oracle
pip install python-nmap

QAL might needs some packages to be able to connect or use some data sources. See the QAL DAL documentation for that.

Notes: Nmap is only used for auto-detecting database servers, if you know all addresses, it can be omitted. Some IT-departments aren't aware of the beneficial uses of Nmap, and could cause issues.

Footnotes

  1. Unix, Windows, Mac OS X
  2. BPM - Business Process Management
Personal tools
Namespaces

Variants
Actions
Navigation
Tools
Language