I thought readers might like a peek at how I spent three hours of my life earlier in the week.The original task was to transfer a set of customer account numbers from a SQL SERVER box to a Teradata (a data warehouse) box in order to find all customer activities associated with those accounts before a certain cutoff date.
Within minutes, it became clear that the real task was getting Teradata to recognize a column of dates as dates. The column looks like this: 07/20/2010, 07/25/2010, ...
What's the problem? You say, of course, anyone can see these are dates. Well, Teradata disagrees - and until and unless Teradata is fully convinced I have offered it a column of dates, I couldn't proceed with my original task, which was to compare these dates to the cutoff date.
Teradata thought those were strings of characters, not dates. I tried the simplest solution first: cast(datecolumn as date). Teradata wouldn't budge, complaining that my datecolumn were "invalid dates".
From the manual, I confirmed that a valid way to use the cast function is cast('2010-07-20' as date). So, I needed a way to convert 07/20/2010 into '2010-07-20'.
I fumbled around a bit as I learned that Teradata does not support many classes of solutions I'm familar with, like regular expressions, an MDY type function (which produces a date given month, day and year inputs), a find-and-substitute function, etc. So I stopped being cute, and reluctantly did it the brute-force way, using substrings and concatenates.
First, I tested cast('2010-07-20' as date) and it created a Teradata date 07/20/2010. Yes, that looked exactly like my input data but human eyes deceive, if the database proclaimed it not to be a date, then it was not a date.
Next, I substituted '2010-07-20' with that brute-force substring-concatenate expression. To my surprise, it failed, complaining again of invalid dates. I fished out some samples of these dates. On visual inspection, they looked like dates. Smelled like dates.
Undeterred, I took out the cast-as-date function and applied the substring-concatenate expression to the full column of dates, and it ran without a hitch. Then, I put the cast-as-date function back in, which instantaneously bombed.
Now that the dumb way stumbled, I went back to being cute. Maybe I could trick Teradata by splitting into two steps, first creating a new data table with the substring-concatenate output, which seemed to have worked before, and then running the cast-as-date function.
Maybe not. As soon as I put the substring-concatenate expression together with two lines of code that generate data tables, it choked. The mystery deepened. The same code when used without the data-table generation code, succeeded in producing Teradata dates, and yet as soon as I wrapped it inside the data-table code, it stalled. The error officially had to do wtih a missing something between the date variable and the comma sign inside the substring function. It was a so-called "syntax error". Very troubling because the same code ran smoothly when the output was ported to a pop-up window but when the output was to be stored in a data table, the server apparently wanted a different syntax! In any case, I couldn't figure out what Teradata was grumbling about.
***
Teradata and I were not friends at the moment. What to do? Like a spurned lover, I sought out my other good friend, SQL SERVER. What if I converted the column of dates to dates first before the data got transferred into Teradata?
So I did that. After a laborious procedure, the data got moved into Teradata. Alas, the dates still showed up as strings of text. Doubled back to SQL SERVER: there, the dates were dates. This meant the program used to transfer data between the two platforms would not recognize those as dates.
My colleague suggested a Hail Mary. (Yes, now two "data scientists" were working on this glamorous problem.) Makes the dates "datetime". Datetime looks like this: 07/20/2010 00:00:00. The time component will be all zeroes since my data contains no information about time. It was a Hail Mary because we had really no rationale why the database would read datetime but not date. You just do random things when you run out of rational ideas.
It worked. It worked. It worked.
SQL Server converted the column of dates to datetime format. Teradata not only reads this correctly but reads this three times, once as a datetime, once as a date and once as a time.
And I skipped over the exasperating data transfer procedure. Importing the data into Teradata requires a special utility. Half the time, the utility will not launch properly, and when that happens, I issue a command-line instruction to reset the utility from the operating system. On this particular day, after the utility opened, it took five tries to establish a network connection. One setting must be switched from default before running the utility. Switching that setting always snips the network connection. So, it was another five tries to restore the connection. And only then could the data transfer get off the ground.
***
Three hours later, it worked. The "it" has morphed from finding customer activites to getting a database to see '07/20/2010' as a date not text.
The SQL SERVER box has closed up shop for the night. I still haven't found a single customer transaction. The project has a long way to go.
Every project one runs into situations like this. It's not an outlier. Welcome to the world of data science.
What quacks like a duck, but isn't a duck. Sounds like Teradata still needs a little work.
Posted by: Jon Peltier | 11/04/2011 at 08:55 AM
I spent a frustrating afternoon trying to figure out SAS's LAG function (which does not work as one would expect in case of conditional statements). You can read the basic problem (and workaround) in this lovely paper should you even have such problems: www.howles.com/saspapers/CC33.pdf.
I've found that understanding how different languages give help is useful. SAS tends to have a nice paper written at a user's group for whatever problem one might encounter, and a pile of strange actions due to PDV interpretation. R tends to have a listserve where the author of the function tells users to RTFM, and code snippets with other difficult to understand functions.
Posted by: Cody L. Custis | 11/04/2011 at 11:26 AM
I don't think the default string representation of a date value is always yyyy-mm-dd. There's a dateform setting that can be set to 'integerdate' which is yy/mm/dd. This is probably more likely in environments that are supporting older applications, since recent versions of Teradata default to yyyy-mm-dd.
You can also simply tell Teradata the date format in the CAST statement. This would have saved you some time. One example: http://www.tek-tips.com/viewthread.cfm?qid=1157920
The client display format for dates (mm/dd/yyyy) is a separate matter, as you noted.
Posted by: KH | 11/04/2011 at 12:14 PM
To be more explicit, if your string is '07/20/2010' in a field named "my_dt", your cast statement would be:
CAST(my_dt as DATE FORMAT 'MM/DD/YYYY')
No string manipulation should be required.
Posted by: KH | 11/04/2011 at 12:16 PM
I had a similar problem long long time ago. In this galaxy, though. The same trick you used seemed to fix things. After a while I started to notice some anomalies.
The fix? Use 12:00:00 instead.
Why? DTS.
I hate dates. Especially weeks. Leapweeks.
Posted by: MV | 11/04/2011 at 12:56 PM
"we had really no rationale why the database would read datetime but not date. You just do random things when you run out of rational ideas."
The academic subject of "Computer Science" is misnamed because as a field of study, it does not use the scientific method of empirical observation. And yet, much of real software development, which should not properly be called "Computer Science," relies heavily on empirical observation. Discuss.
Posted by: njnnja | 11/04/2011 at 01:21 PM
I hate it, too, when I have to go to the concatinate expressions to solve one of what I call the "massaging the data" issues. And yes, they always occur and they almost almost mess up my completion time estimates! Ugh! When a VP is breathing down your throat and wants a report by end of day, it is always a three hour hold up like this that makes you late for dinner, isn't it??
Posted by: Kim Terek | 11/05/2011 at 12:40 PM
Jon: while I have issues with Teradata, this post isn't meant to single out that vendor. Similar issues happen with every vendor in one way or another.
I think Teradata makes a product that hugely favors performance over feature set. For example, if it knows that all dates are in one format and one format only, it doesn't have to waste time dealing with transformations. (Cynics will say this just reallocates the time from data warehouse to the analyst.) Maybe people who know more about data warehouses can correct me if I'm mistaken.
KH: the cast function was the first thing attempted. It didn't work. Even the DBA suggested that we don't process dates within Teradata.
njnnja: maybe some day I'll post my thoughts about "computer science". Good point.
Kim: in this case, I wasn't late for dinner because the servers are reserved for production staff overnight :)
Posted by: Kaiser | 11/06/2011 at 10:56 AM
Did you consider using SQL Server's Integration Services tool? It comes "free" with SQL Server and its purpose in life is moving data in and out of the database.
Posted by: Colin Davies | 11/12/2011 at 12:39 PM
They say 99% of a soldier's life is boredom and 1% is terror. I guess the message here is not about dates, Teradata or SQLServer but that 99% of a data scientist's work is about data preparation and 1% is .... well, we never did get the 1%, maybe tomorrow.
Posted by: george | 01/18/2013 at 05:47 AM