|Academic Computing and Communications Center|
Identifying and Correcting Dates with Two-Digit Years
The year 2000 is over now, but you might still have Y2K problems to consider -- dates with two-digit years in your programs and data, which is discussed in this article. Another very valuable resource is IBM's The Year 2000 and 2-Digit Dates: A Guide for Planning and Implementation, a 250-page book that's available online. "Chapter 4. Identifying 2-Digit-Year Exposures" and and "Chapter 5. Reformatting Year-Date Notation" cover the questions of how to find two-digit date problems and what to do with affected data and the programs you use to process it.
|Pinpointing the Problems|
Consider all your processes, projects, and studies, and ask yourself, "Does
this process, project, or study have anything to do with the passage of time?"
Does it, for example, record the date on which anything occurred? And then consider
today's date? Subtract the two, and you might have a problem.
Now you've got to go looking for those dates.
The examples in this section are fragments of SPSS code. SPSS is neither better nor worse than others with regard to Year 2000 problems. The problems and solutions here apply also to SAS, BMDP, C, FORTRAN, UNIX's s program, Rexx, and so on. SPSS is commonly used, on all computing platforms from mainframes to UNIX, PCs, and Macs, and so it seemed to be a good choice for examples.
(But beware, dates can hide! You might, for example, solve all the problems in your data and programs, but forget about the data's containers -- file names themselves might contain dates. For instance, your data from July 17, 1998 might be stored in a file named DT980717.DAT. That's a date, a date with a two-digit year that has a year 2000 problem!)
|-- Finding Year Problems in Programs|
A reasonable way to start to find dates is to use a file searching program, such
as grep on UNIX or CMS, or Advanced mode Find in Windows 95, and
look for the string "date" in all files with extensions suggesting that they are
SPSS programs, SAS programs, or the like. You might also look for the strings
"yy", "year", or "yr". (This will help you find dates in your programs, and, assuming
that you know which data files are/were used as input to those programs, it will
also identify data files that need fixed.)
Here's an obvious example of the kind of thing you're looking for, one that this method will point out. This SPSS code fragment reads dates, and it is easy to tell from the variable names what is going on, and that there is a Y2K problem -- it will report that babies born in 2000 are 100 years old:
DATA LIST / datemo 1-2 datedy 3-4 dateyr 5-6 /* read mmddyy data */ COMPUTE thisyr = XDATE.YEAR($DATE) /* Get this year - 4 digits */ COMPUTE age = thisyr - (dateyr+1900) /* calculate age by subtracting */The year part of the date is stored in a variable named dateyr and it's only two digits wide (columns 5-6) in the input. (It is interesting to notice in this example how the decisions made two decades ago were compensated for one decade ago -- by simply adding 1900 to the year. It was assumed by most until just recently that the twentieth century would last forever.)
If only all Y2K problems were all this easy to spot!
Now let's look at another SPSS procedure that does the same thing, but this one will be nearly impossible to spot by mechanical searching. These kinds of non-descriptive variable names are actually encountered more often in the real world than the nice, descriptive ones in the example above.
DATA LIST / v1 1-2 v2 3-4 v3 5-6 v4 7-11 v5 12-16 SORT CASES v3 v1 v2 REPORT ......and now you go looking for those recent babies down at the bottom of the file, and you find to your alarm that they've been kidnapped! (They're still there of course, just before the oldest people in the study, at the top of the file.) Look at it carefully -- would this be easily spotted as a year 2000 problem? There are no clues. The clue comes from your knowledge of what this program does: "Oh, that's the one we run to print a list of subjects arranged by age."
|Remediation a fancy word for repair; you'll see in some of the literature about the Year 2000.|
|-- Fixing File Names that Are Dates|
It is extremely commonplace to encounter schemes for storing data within computer
files where the file name contains the date when the data was collected. For instance,
the data for July 17, 1998 might be stored in a file called DT980717.DAT. If that
file naming scheme contains a two-digit year, raise The
Proverbial Red Flag. When those two digits become 00, you may no longer
be able to find your newest data. Worse, in further data analysis, you could reprocess
the data for December 31, 1999 as the data for every day thereafter, because its
file name would have a greater value. You might not spot such a problem until
your research assistant came in to tell you that, since New Years Eve, the data
for every day seems to look strikingly similar, despite the care you took to make
sure your lab instruments were ready for 2000. By then, the raw data files for
some days might have been lost - especially if you automatically erase the oldest
data to conserve disk space. (This is essentially the same problem as a warehouse
where the stock stopped rotating, resulting in unexpected spoilage.)
If you can, you should expand the year part of the file name to four digits. However, that might not be an option due to limits on file name length. Another option might be to shrink the month down to a single alphabetic character, where A=January, B=February, and so on. Then use the space you gained for a century key, where 1900s=0, 2000s=1, 2100s=2 and so on. The file name DT980717.DAT would become DT098G17.DAT, and the data collected on February 29, 2000 would be in file DT100B29.DAT. Note that these two file names would sort correctly. (As you devise file name schemes, do not depend on whether numbers will sort before letters, or after them -- this differs on different computers.)
|-- Fixing Two-Digit Years at the Data Source|
The best way to fix problems involving two-digit years is at the source. It is
also the hardest. Change your data entry procedures to record all four digits
of the year. While you're at it, examine whether or not you can ship the data
file around in a database-type format, instead of as raw data. That can really
save a lot of effort. For instance, lets say that your study's data entry has
evolved from the keypunch machine of 25 years ago to a form-filling-out thing
based on MS/Access on a PC. After the data is collected, you have Access format
it with everything in all the same columns as you used in your 25-year-old punched
cards, write it out into a text file, and then use that text file as input into
SPSS. Complete with two-digit year.
A remediation strategy here might be to have Access require the user to enter all four digits of the year (Access can do that), then to save the input file it creates in a database format such as .dbf, which SPSS can read directly. If variable names do not match anymore, they can be renamed easily.
Data acquired from another source requires investigation of that source and review of its year 2000 compliance. You also need to review the format you are receiving the data in. If possible, receive data in "database" format, such as a dBase .dbf file, an SPSS System File, or a SAS dataset. These database formats store date data in internal formats which will carry on well into the next century without any problem. Such formats are also preferable to text formats in that they contain data dictionary information as a part of the file.
(Note that it's not enough to make sure that data is in database format; you should also examine the assigned format widths to insure that they are sufficient for a four-digit year. For instance, if the SPSS display dictionary command shows a format type of ADATE8, you will only get a two-digit year, even though it might be stored internally in the database with all four digits. It must be at least ADATE10 for a four-digit year to be displayed.)
So, now that first example becomes:
DATA LIST / datemo 1-2 datedy 3-4 dateyr 5-6 datehund 81-82 /* read it */ IF (SYSMIS(datehund)) centmark = 19 /* Not there? It's old data. */ COMPUTE dateyr = (centmark * 100) + dateyr /* Combine year parts */ COMPUTE thisyr = XDATE.YEAR($DATE) /* Get this year - 4 digits */ COMPUTE age = thisyr - dateyr /* calculate age by subtracting */...and the recently born babies are young again. The Year 2000 and 2-Digit Dates: A Guide for Planning and Implementation.
The methods discussed (including the implications, both pro and con, examples, and how to information) in the IBM book are:
|And What If That Cannot Be Done?|
In this case, you need to adopt a technique called "windowing". Windowing means taking the two-digit year and applying common sense to determine the century that it belongs in.
Obviously, windowing cannot be used for data that might span a period greater than 100 years. For instance, the birth years of people in the general population. There have always been a number of people living beyond the age of 100, and as health care improves that number can only increase. If you code a birth year as "96", that could be either 1996, which would indicate a two-year-old, or 1896, which could indicate a 102-year-old. For a short time after the year 2000, there will be people alive who were born in three different centuries!
Example 1: 100 Year Fixed Window, 1973 to 2072
In the 100-year-old babies study, we're studying children, and the oldest ones in our study were born in 1973. So we don't have to worry about data spanning more than 1000 years until 2073. So in this case, we could can establish our "window of time" as being 1973-2072, since we already know we have no data from before 1973. This simplest form of windowing is called a "fixed window".
DATA LIST / datemo 1-2 datedy 3-4 dateyr 5-6 /* read mmddyy data */ COMPUTE dateyr = dateyr + 1900 IF (dateyr < 1973) dateyr = dateyr + 100 /* Years 2000-2072 */ COMPUTE thisyr = XDATE.YEAR($DATE) /* Get this year - 4 digits */ COMPUTE age = thisyr - dateyr /* calculate age by subtracting */
Once again -- young babies.
Example 2: One Year Window
Consider data output by a lab instrument that contains a two-digit year. You have verified that it will correctly change from 99 to 00 in the year 2000, and that it will consider 2000 to be a leap year. (In fact, you discover that the processor at its heart is a now-obsolete IBM PC, running DOS, so you know you will need to reset its date on January 1, 2000, see: "IBM/Intel/Windows PC BIOS Tick-Over Bug".) Your computer programs process the data output from this machine within a month of the time it was emitted by the machine. All archives of data from this machine are stored with a four-digit year that has been calculated by your program that receives the raw data.
The time period you need to worry about is only one month, so your window need only be for the previous year. If the two-digit year from the machine is ever greater than the present two-digit year, the data is from the previous century.
Example 3: 50-Year Sliding Window
Consider home mortgages. These typically last up to 30 years. Since you must deal both with mortgages made 30 years ago which are about to be paid off, and with mortgages made today which will not be paid off until 30 years hence, the total span is 60 years. None last for anywhere near 100 years. This makes it a good candidate for the most ordinary and generally useful form of windowing -- the 50-year sliding window, based on today's date.
Warning About Windowing
Windowing is not free from risks! If you try to develop a single windowing rule, and apply it to all situations, it will be wrong some of the time. You must develop your windowing formula according to your knowledge of the data involved.
Be particularly careful of windowing more than once. Most real-world applications consist of chains of applications strung together by intermediate files, like so many beads. You can actually corrupt your data if you use different windows at different places in the chain. The two-digit year 55 might be understood to be 1955 at one point, but as 2055 at another.
Windowing, in particular, must be tested. This is an area where your inventiveness can get the best of you, and can cause errors. In the literature on Year 2000 Time Machine testing, a great many of the errors that were found and fixed involved doing widowing incorrectly.
See chapter 5 of IBM's The Year 2000 and 2-Digit Dates: A Guide for Planning and Implementation. For more information about bridge programs.
|External Windowing: How Statistical Packages Handle Two-Digit Years|
Statistical and Database Packages may or may not include features for automatically
windowing two-digit years into four-digit years. We will call this "External Windowing".
Their use is discouraged. The reason they are not good is that they might
be applied across the board without regard to the nature of the data set itself.
Also, external windowing might conflict with any internal windowing formula that
somebody might have coded into the procedure itself,
SPSS version 8 for Windows (only) contains a feature for adjusting SPSS's external windowing to be something other than 1900-1999. Its use is discouraged, especially because it could break older SPSS programs that depended on the previous 1900-1999 window.
You will probably be unable to avoid having your windowing formulas conflict with these facilities, so you should arrange not to read any two-digit years at all in SAS. Instead, read them as regular numeric variables, and do the windowing calculations yourself, based on your knowledge of the data.
This page last updated 1998-10-08. Please send comments and reports of broken links to the author: Roger Deschner
|Using ISO 8601 Dates||Previous: 4. On Personal Computers||Next: A1. Important Dates|
|2001-12-21 ACCC Documentation||