GEDCOM Match and Merge


Introduction

You sometimes find distant relatives who have a GEDCOM you want to merge into your database. You can't just add their file to the end of yours. There are 2 main problems:

  1. The INDI and FAM IDs (e.g. @I99@) may be used in both databases.
  2. You will get duplicate people/families (INDIs and FAMs).
The first problem is easy to solve. As the program reads in the new GEDCOM, it can generate unique IDs and replace the ones it reads in.

You almost always have the second problem because if you didn't have people in common, there is no reason to read in their file. But creating duplicates is the last thing you want.

When people ask about merging in another GEDCOM, what they really want is to integrate common people and then merge in the rest.

But before you can integrate the common people, you must first match them. It may be obvious to us (us nerds) that James T. Kirk and Jim Kirk are the same person, but a computer program would see them as different people.

After you match everyone up, there is still one more problem. In some cases the imported details are completely wrong! Even when the differences are minor, like James vs. Jim, you still have to decide which details to accept.

This means that a GEDCOM merge is actually 4 main steps:

  1. Find all matching common people.
  2. Chose the details to accept.
  3. Merge in the accepted details for existing people.
  4. Merge in the rest of the people they have.

Details on the 4 Main Steps

  1. Find all matching common people.

    This is the hardest part of the merge process. The first problem is getting started. Once you have the same person identified in both GEDCOMs, it can move up and down the family tree matching everyone else.

    At each point in the family tree, the program can present side-by-side families, and let the user verify they match up. If a man was married twice, he might be paired up with the wrong wife.

    Husband Allen OK Allen
    Wife Betty ? Francis
    Child 1 Charles ? Gary
    Child 2 Daniel ? Henry

    When you reject the Wife match, it should then find the other wife and show that family. But in the 2nd GEDCOM the children may not be listed in the right birth order.

    Husband Allen OK Allen
    Wife Betty OK Betty
    Child 1 Charles ? Daniel
    Child 2 Daniel ? Charles

    When you then reject the Child 1 match, it should then rearrange the children until it finds the right match.

    Husband Allen OK Allen
    Wife Betty OK Betty
    Child 1 Charles OK Charles
    Child 2 Daniel OK Daniel

    After this family is matched, it should look at the spouse and children for each member of this family.

  2. Chose the details to accept.

    ... to do ...

  3. Merge in the accepted details for existing people.

    ... to do ...

  4. Merge in the rest of the people they have.

    ... to do ...

Plan for GDBI

It has always been a plan to add merging to GDBI, and work on it has finally begun. At this point it is just analysis code to help decide if 2 people match. Eventually you should be able to run GDBI, open your database, specify a GEDCOM text file, and begin the merge process (described above).

What GDBI has so far is a simple test program for matching people. That hard part of that program is comparing all the details for each possible pair of people (from the old and new database) to have it automatically find the matching people. (Matching them all manually is tedious.) It has a basic GUI for choosing a primary and import database, and then selecting the starting person in each database. It needs to be enhanced to chose which details to take, and then finally to merge.

Before Starting

If you plan to merge in your relative's GEDCOM from time to time, make sure they have unique IDs for all of their records. That way you will only have to drudge through the matching process once. After that your database will have their IDs, and every time you re-merge, it will find the matching IDs.

Not all genealogy programs have this feature, but some will generate unique values for the REFN, RFN, or _UID tags. REFN and RFN are standard tags, and _UID is a popular extension. If they can't generate these tags automatically, it helps if they add a few manually to give the matching program a starting point.

Unique ID Generation

If we expect other programs to generate unique IDs, GDBI needs to do it as well. This can either be done by GDBI itself or by the databases that it connects to (PGV, GenJ, jLL). We recently discussed adding it to PGV.

The only thing worse than having no ID is having a non-unique ID. It defeats the purpose. We need a technique for generating good IDs. This was recently discussed on the GEDCOM-L mail list, and these ideas were proposed:

External Links

Back to the GDBI Home Page


$Header: /cvsroot/gdbi/doc/webpage/htdocs/merge.html,v 1.8 2005/02/11 03:15:01 dkionka Exp $