Daily Archives: July 2, 2011

My Struggle with Legacy Family Tree Sourcing – Part 2

In my previous post (link), I started the discussion of sources and citations and lumping and splitting as related to creating sources in Legacy Family Tree software. Part 2 will talk a little about the database structure and the trade-0ffs of lumping and splitting.

I made a very simplistic, high-level diagram of how things work in Legacy. It shows a source table with 5 sources. Next is the Citation Table which shows 5 citations all pointing back to source 1. Notice, however, the first three of these citation records contains identical data. This identical data is stored three times because each one points to a different event. The 4th and 5th Citation Data records are also identical.

Diagram 1 - Legacy Source/Citation Database Overview

This data model involves redundant storage of data and is very bad in terms of data maintainability. Now, in order to maintain data integrity, if any piece of the citation data needs to be changed it is necessary to find all copies and change each of them. Legacy shifts this onus of maintaining data integrity to the user, supplying only a Search and Replace tool. But there are several cases where Search and Replace cannot make the desired change easily and some cases where it cannot make the desired change at all. It then falls on the user to edit each copy of the citation data manually.

Diagram 2 illustrates the exact same relationships, but the citation data/event links are pulled out into a separate table. This allows for each unique citation record to be stored once, regardless of the number of events to which it is linked.

Diagram 2 - Alternate Data Model

Had Legacy implemented a model similar to diagram 2, I would be a very happy camper. But they didn’t. So now it all comes down to trade-offs. On one hand become a splitter. This entails creating a multitude of very specific, repetitious sources and minimizing the citation data content. In many, but not all cases, it gets rid of the data integrity problem as described above. On the downside, since you have so many sources, you need to be very careful and vigilant when you create them so that all the repetitious data is consistent from one source to the next. (That is if you care about consistent wording and formatting.) Admittedly, the ability to create a new source by copying an existing one helps with this.

It is also extremely important to come up with an organization scheme that is built into the source name. All Legacy does is present you with an alphabetized list of sources – no filtering or classification on the list. Granted, this is also a problem for lumpers, but the problem is magnified for splitters because they have so many more sources! Over the past couple of years, I’ve tweaked my naming system a couple of times to force the sort order of the source list. I don’t want to even image how much more laborious and time-consuming this would have been were I an extreme splitter! And while I don’t know of a hard-limit on sources, if you have a large database of people you may find that extreme splitting creates just too many sources to be practical.

On the flip-side. lumpers are more likely to have a more manageable number of sources. This has the advantage of making it somewhat easier to tweak them. I have actually tweaked my naming convention to force sort order (as mentioned above) as well as tweaked wording and content of source fields for better consistency in how footnotes and endnotes appear in reports. I also feel that some degree of lumping is more in keeping with database and software design principles. The downside of lumping is, of course, the citation maintainability problem which is the focus of the article.

So where does all this leave me. Well, I’ve already characterized myself as a lumper – although I am sure there are some who are more extreme. Over the years I’ve done a lot of experimenting and tweaking and I like the system I have for deciding what to include in the source table vs what to include in the citation. What works best for me is to define a source for each dataset or data collection. So each database in Ancestry gets their own source. I also tend to create separate sources for each of the datasets stored on the USGenWeb county pages. On the splitter/lumper spectrum, I would say that I’m a moderate lumper. In terms of Source Writer, I have recently decided to primarily use just 2 templates  for online data – online database or online database with images. (But that’s starting to get into a whole other discussion — the multitude of source writer templates!)

My struggle with Legacy sourcing is nothing new. It’s an issue I’ve had to deal with from the time I first started using it back in early 2004. Based on what people say on the LUG email list, the database implementation has pushed some people toward extreme splitting. That never felt right to me, and by now I’m just too entrenched in the way I have been doing things to even consider changing. Why then did I bother writing this post? After creating the SAR source and citation and linking it to about 30 events (so 30 copies of the citation) it occurred to me I should have included something additional in the citation. I tried to fix it with search and replace and guess what — it didn’t work!!!! It said it found and made 30 changes, but when I looked at the data, it had NOT changed. So mostly out of frustration I wrote this post. I still have to go back and try to fix things up, but at least I had a chance to vent.

Maybe some day Legacy will fix this design. Maybe some day some other company will design and build genealogy software that does not have this issue. Maybe some company already has — sometimes it’s hard to tell from the marketing oriented data on their websites. Maybe somebody will see this post and let me know!!

My Struggle with Legacy Family Tree Sourcing – part 1

Yesterday afternoon I got an email from Ancestry touting their newly added US Sons of the American Revolution Application dataset. (By the way, you can access it FREE this weekend in celebration of the Fourth of July holiday.) So I naturally logged on to Ancestry, went to the new SAR dataset and entered one of the surnames that I research to see what would pop up. And sure enough, I got some hits! The third or fourth one on the list I recognized as being one of my 5x great-grandfathers. Clicking on his name sent me to a screen where I could access the actual scanned image of a SAR application. And in viewing the application I was able to see birth, death and marriage dates and places for the generations that separated the applicant from the patriot.

Now, for me, this is both a blessing and a curse! Why? Because now I have to decide how to structure this within the confines of the method Legacy Family Tree has implemented sources and citations. Basically, I need to decide what information from the SAR application I want to store in the source and what I want to store in the citation. This discussion comes up often on the Legacy Users Group email list and is generally referred to as “lumping and splitting.” (FYI: Legacy has chosen not to provide a forum/message board, and with their email archiving being fractured and possibly dropping messages, many questions are repeated periodically.)

I think the both the source/citation and so-called lumping/splitting concepts are best illustrated with examples. The most easily understood source example is a book. Most people (regardless of whether they are lumpers or splitters) would agree that the book information (title, author, publisher, date, etc) should be stored in the source table and the page (or chapter, page range, etc) should be stored in the citation. As this illustrates, the purpose of the citation is to link the source to the event, and as such it should refer to the specific subset of the source that contains the data that was extracted and inserted into the event or fact.

Another common source – census data – is much less straight forward in terms of what part is source and what part is citation. If you follow Legacy Source Writer templates for US Federal Censuses, your sources will be specific to year, state, county and online database. In other words:

  • source 1 – 1880 Census, Pennsylvania, Berks, Ancestry.com
  • source 2 – 1880 Census, Pennsylvania, Berks, HeritageQuest
  • source 3 – 1880 Census, Pennsylvania, Chester, Ancestry.com
  • source 4 – 1800 Census, Pennsylvania, Chester, HeritageQuest

Splitters may break this down further, with separate sources for each town or city. Some have even suggested each census sheet is a separate source! As you can see, splitters have a very large number of very specific sources. Very little, if any, additional data is stored in the citation and it becomes just a link between source and event.

On the other hand, I fall into the lumper category. I still use Source Writer, but I leave the state and county fields blank when I create the census source. Then, when I add a citation, I preface the municipality field with the state and county. Since the source writer citation form does not have a logical place to store the online database (i.e. Ancestry vs. HeritageQuest, etc), my sources are based on year and online database. I’d rather it just be year, but I want to know which online database I used, so I make this concession in order to use source writer. Thus my sources look something this:

  • Source 1 – 1880 US Federal Census, Ancestry
  • Source 2 – 1880 US Federal Census, HeritageQuest

In general, lumpers have fewer sources and those sources are (for the most part) not repetitious. My guess is that most Legacy users who have a background in software or database design will probably lean toward the lumper end of the spectrum because that more closely follows the design principles to which we are accustomed.

Getting back to the SAR application and how that fits into the source/citation structure. A splitter would most likely consider each application a separate source. On the other hand, I consider the Ancestry SAR Application Collection the source, thus the specific data on an individual application would be part of my citation. But for a document like a SAR application, where it will serve as a citation for many events or facts, the fact that Legacy stores multiple copies of the citations is trouble waiting to happen. — more on this later in part 2 of this article [link].