Recently there was a request to create a function that would parse out the city, state and postal code information from a single string into separate columns for both US and Canadian addresses. There are obviously many different ways to approach this problem, and in my first few attempts I tried using the PATINDEX function with regular expression to first extract the postal code and then to identify the state and city in that order. Since using regular expressions in T-SQL does not quite work the same as using regular expression in managed code, I was finding it difficult to account for all the different variations without explicitly defining each and everyone of them first. The comments in the code below explain some of these variations.
I wanted to keep it simple avoid solutions involving CLR or using OLE functions. Yes, a CLR solution would probably be the best in terms of both performance and robustness, but I wanted to keep the solution within the scope of a single T-SQL function and I think the solution below works quite well.
Below you’ll find the complete code for the function along with comments and usage documentation. Please take note of the assumptions being made about the data in the code comments notes section. I did test this against a large data set with many different variations which are also mentioned in the code comments. If you come across any examples that don’t parse correctly with this code, please let me know in a comment below. Hope you enjoy.
SET ANSI_NULLS ON; GO SET QUOTED_IDENTIFIER ON; GO IF EXISTS (SELECT * FROM [sys].[objects] WHERE [object_id] = OBJECT_ID(N'[dbo].[ParseAddress]') AND [type] IN (N'FN', N'IF', N'TF', N'FS', N'FT')) BEGIN DROP FUNCTION [dbo].[ParseAddress]; END; GO /*========================================================================================================= Script: dbo.ParseAddress.sql Synopsis: Function to parse an address string containing city, state and zip code information. This should work for any United States or Canada address string. Tested using multiple variations including 5-digit zip codes, 5-digit zip code + 4-digit extension separated with 1+ space(s), hyphen, or no separation and Canadian 6-character postal codes with the same variations. This function will also discard non-alpha characters and multiple spaces used to separate city, state and zip code information. Usage: SELECT [Address] ,[City] = [dbo].[ParseAddress]([Address], N'City') ,[State] = [dbo].[ParseAddress]([Address], N'State') ,[Zip] = [dbo].[ParseAddress]([Address], N'Zip') Notes: The following assumptions are made about the data for parsing: - Zip code information is the last part of the string - Zip code will always contain 5+ characters. - State will always be two letters. - City will always end with a letter. =========================================================================================================== Revision History: Date Author Description ----------------------------------------------------------------------------------------------------------- ===========================================================================================================*/ CREATE FUNCTION [dbo].[ParseAddress] ( @String NVARCHAR(64), @Get NVARCHAR(64) ) RETURNS NVARCHAR(64) AS BEGIN DECLARE @Address AS NVARCHAR(64); DECLARE @City AS NVARCHAR(25); DECLARE @State AS NVARCHAR( 2); DECLARE @Zip AS NVARCHAR(10); DECLARE @Index AS TINYINT ; DECLARE @Char AS NCHAR( 1); DECLARE @Value AS NVARCHAR(64); -- Remove any leading or trailing white space SET @Address = LTRIM(RTRIM(@String)); -- Initialize string index SET @Index = 1; WHILE (@Index <= LEN(@Address)) BEGIN SET @Char = SUBSTRING(REVERSE(@Address), @Index, 1); IF (@Zip IS NULL OR LEN(@Zip) < 5) BEGIN -- Continue reading valid characters for @Zip WHILE (PATINDEX(N'[a-zA-Z0-9]', @Char) = 1) BEGIN SET @Zip = ISNULL(STUFF(@Zip, 1, 0, UPPER(@Char)), UPPER(@Char)); SET @Index = @Index + 1; SET @Char = SUBSTRING(REVERSE(@Address), @Index, 1); END; END; IF (@State IS NULL OR LEN(@State) <> 2) BEGIN -- Continue reading valid characters for @State WHILE (PATINDEX(N'[a-zA-Z]', @Char) = 1) BEGIN SET @State = ISNULL(STUFF(@State, 1, 0, UPPER(@Char)), UPPER(@Char)); SET @Index = @Index + 1; SET @Char = SUBSTRING(REVERSE(@Address), @Index, 1); END; END; -- The last character of city should be an alpha character IF (PATINDEX(N'[a-zA-Z]', @Char) = 1) BEGIN -- Just assign the rest of the string to the @City variable SET @City = SUBSTRING(@Address, 1, LEN(@Address) - @Index + 1); BREAK; END; SET @Index = @Index + 1; END; -- Removes double-spaces from the city name WHILE (CHARINDEX(SPACE(2), @City) > 0) BEGIN SET @City = REPLACE(@City, SPACE(2), SPACE(1)); END; -- Format US Postal Codes that have 4 digit extension by stuffing the hyphen in correct position IF (PATINDEX(N'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', @Zip) = 1) BEGIN SET @Zip = STUFF(@Zip, 6, 0, N'-'); END; -- Format Canadian Postal Codes by stuffing a space in correct position IF (PATINDEX(N'[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]', @Zip) = 1) BEGIN SET @Zip = STUFF(@Zip, 4, 0, N' '); END; IF (@Get = N'City') SET @Value = @City; IF (@Get = N'State') SET @Value = @State; IF (@Get = N'Zip') SET @Value = @Zip; RETURN @Value; END; GO