FileMaking 6: Text Parsing with FileMaker

Originally published in May 2006

Text Parsing With FileMaker

Pardon my extended absence from this column. Life just has a way of getting away from one at times, and the work load piles up. But hopefully I’m back to a normal schedule again.

As mentioned in the last column I wrote regarding the direction of this tutorial series, we’ll now cover some basic text parsing functions. For beginning FileMakers, parsing text is one of the most useful skills to have as well as one of the first that questions are asked about.

What Is Text Parsing

Text parsing is the process of applying a calculation to a string to pull out some other string within it that is useful or to determine if some condition is true regarding the string.

This comes up many times in FileMaker, or for that matter, any database application. I’m currently working on a project where data is received from an outside vendor as a text file, where each line is a record and each field in the record is a particular number of characters in the line (such as, characters 13 through 20 store the creation date of the record, 21 through 30 the product ID, etc.). This format is called a fixed format file because each field is stored in a fixed number of characters. Text parsing calculations allow me to extract the data into separate fields in my database.

Name Extraction

For example, proper database design dictates that information about people should have their different names (first, middle, last, etc.) separated into different fields: FirstName, MiddleName, LastName. However, perhaps you need to import data from another system where all three of these pieces of information were placed in a single field and you want to break them out to their individual components.

FileMaker includes a wide range of text functions that can be used to help with such chores. You can view a list of these text functions by creating a calculation field and choosing “Text Functions” from the View menu.

filemaking-1

However, a better way to become familiar with the available functions is to look through the online help documents that comes with FileMaker. While in FileMaker, press Command-Shift-/ to bring up the help system, and then click on Contents in the top left of the window, followed by Function Reference Category List, and then Text Functions. Click the link for each of the available text functions to become familiar with them.

filemaking-2

For text parsing, the most useful functions will be Left(), Right(), Middle(), Length(), PatternCount, and Position().

So let’s begin by establishing some rules on how one would extract individual names from a full name field. My own name is a good example of a simple full name, with a single word for each of the first, middle, and last names: “Charles Edward Ross.” It is made up of a single word for each name (a word in FileMaker is a string of characters separated by whitespace, such as a space, return, or tab character). We will begin by extracting just the first name. In this case, the first name can be extracted by getting all of the characters before the first space. So the first thing we need to figure out is where the first space occurs, which is done with the Position() function.

Position( FullName; " "; 1; 1 )

Assuming we have a text field called FullName, the above will return the number of the character that has the first occurrence of a space. In the case of my name, this will be 8.

Now that we know where the first space occurs, we want to get all of the characters before it. For this we use the Left() function, which returns characters in a string beginning from the left side for the number we specify. The number of characters we want is 1 less than the position of that first space, so our calculation becomes:

Left( FullName; Position( FullName; " "; 1; 1 ) - 1 )

If Charles Edward Ross is in FullName, the above will return Charles. If FullName contains Chuck Ed Ross, the above will return Chuck.

Now, FileMaker does have more powerful functions available to us that we could use. For instance, I could do the above with the following as well:

LeftWords( FullName; 1 )

However, this tutorial is not only about how to do things in FileMaker, but also how to think about what you’re trying to do. I want to show you how to think about the result, and using both Left() and Position() were more instructive toward this than using LeftWords().

Debugging Our Name Extraction

As we have our FirstName calculation, it will work so long as the first name is the first word. But what if it isn’t? What if the FullName field contains Mr. Charles Edward Ross? Our function will return Mr. instead of Charles. So let’s next account for the possible, but not necessary, presence of a title before the first name.

First, let’s consider which titles we want to account for. Off the top of my head, I can think of, Mr., Mrs., Ms., Miss, Dr., and Rev.. We want to check if the first word of FullName is any of these, and if so, get the second word. If it’s not present, get the first word.

We’ll begin by creating a variable to store the list of titles. Unfortunately, FileMaker doesn’t have a built-in list or array construct, but we can fake one by using a return separated string.

Let(
  {
    Titles = "Mr.¶Mrs.¶Ms.¶Miss¶Dr.¶Rev.";
    FirstWord = LeftWords( FullName; 1 );
    SecondWord = MiddleWords( FullName; 2; 1 )
  };
  
  Case(
    Position( Titles; FirstWord; 1; 1 ) > 0;
    SecondWord;
    FirstWord
  )
)

We begin this calculation by setting a number of variables, including a list of the possible titles that a full name could begin with. For ease of readability, we also set the FirstWord and SecondWord variables.

Then we use a Case() function to determine which word, the first or the second, we should use. The Position() function will return a 0 if the search string (the first word) is not in our searched string (the list of titles). Take a close look at the parameters of the Position() function by checking the documentation. We are looking for the first word of the full name being in our list of titles.

We can simplify this calculation a bit. First of all, our test expression in the Case() function doesn’t have to check for the value being greater than 0. FileMaker interprets any numerical value that is not 0 as being True, so the first thing to simplify the function is to remove the inequality expression:

Let(
  {
    Titles = "Mr.¶Mrs.¶Ms.¶Miss¶Dr.¶Rev.";
    FirstWord = LeftWords( FullName; 1 );
    SecondWord = MiddleWords( FullName; 2; 1 )
  };
  
  Case(
    Position( Titles; FirstWord; 1; 1 );
    SecondWord;
    FirstWord
  )
)

The second thing we can do is realize that using Position as a test for one string simply containing another string is probably going to be a common need. That implies that we could create a custom function for this purpose, called perhaps Contains(). We’ll create this custom function to take two parameters, a text parameter and a searchString parameter (see the article on creating custom functions). Here’s the definition of our custom function:

PatternCount( text; searchString; 1; 1)

After defining this, we have the following for our current FirstName calculation:

Let(
  {
    Titles = "Mr.¶Mrs.¶Ms.¶Miss¶Dr.¶Rev.";
    FirstWord = LeftWords( FullName; 1 );
    SecondWord = MiddleWords( FullName; 2; 1 )
  };
  
  Case(
    Contains( Titles; FirstWord );
    SecondWord;
    FirstWord
  )
)

Parsing Out the Last Name

Now that we can extract the first name with some reliability, let’s work on the last name. We’ll begin with a calculation similar to that of the first name calculation, accounting for a possible suffix. In this case, we’ll assume that any existing suffix appears after a comma, as in Charles Edward Ross, Jr. or Charles Edward Ross, Ph.D.. For the time being, we’ll also assume that there can only be one suffix.

Let(
  {
    LastWord = RightWords( FullName; 1 );
    SecondToLastWord = MiddleWords( FullName; WordCount( FullName ) - 1; 1)
  };
  
  Case(
    Contains( FullName; "," );
    LastWord;
    SecondToLastWord
  )
)

If FullName contains a comma, then the above will return the second to last word in FullName. If it does not have a comma, then the last word is returned.

Note the use of WordCount() to determine which word is the second-to-last word. As you would expect, WordCount() return the number of words in a string, so WordCount( "Charles Edward Ross, Jr." ) would return 4. Subtracting 1 from this gives us word 3, and using MiddleWords() to begin at word 3 and get 1 word returns "Ross".

Refining Last Name Extraction

Again, there’s an exception to the rules we’ve covered for last names: Not all last names are a single word. Some last names are two words, as in Mc Gowan (my wife’s maiden name) or von Neumann. Other last names might begin with de or O (without an apostrophe). Again, off the top of my head, I can think of Mc, Mac, O, De, and Von. I’m sure there are others, but that will suffice for now.

So here is our newly defined rule for last names. Assume that the full name has n words in it. If word n is not a listed suffix, and word n–1 is not a listed last name word, the last name is word n. If word n–1 is a listed last name word, the last name is words n–1 through n. If there is a suffix, shift the last name (and the tests for a separate last name word) to the left by one word.

Sounds complicated. Let’s see what it looks like. First let’s add a variable to hold the possible words that a last name can begin with:

Let(
  {
    LNamePrefixes = " Mc ¶ Mac ¶ O ¶ De ¶ Von ";
    LastWord = RightWords( FullName; 1 );
    SecondLastWord = MiddleWords( FullName; WordCount( FullName ) - 1; 1);
    ThirdLastWord = MiddleWords( FullName; WordCount( FullName ) - 2; 1 )
  };
  
  Case(
    Contains( FullName; ",");
    Case(
      Contains( LNamePrefixes; ThirdLastWord );
      ThirdLastWord & " " & SecondLastWord;
      SecondLastWord
    );
    Case(
      Contains( LNamePrefixes; SecondLastWord );
      SecondLastWord & " " & LastWord;
      LastWord
    )
  )
)

We’ve surrounded our last name prefixes with spaces because we only want to find them if they are separate words. We don’t want to catch the word before the final last name if it’s O’Grady, only if it’s O Grady.

Note that this isn’t a perfect extraction of first name and last name. There are many other exceptions we could account for, a suffix without a comma, or three-word last names (de la Hoya). But this should give you plenty to begin with for your own text parsing routines.

Email Verification

Text parsing isn’t always pulling out a piece of a string. Sometimes you need to know if a string satisfies a condition. A classic example is if the string is a valid e-mail address. You can’t verify that it’s valid in the sense that e-mail sent to the address will arrive at an e-mail server someplace, but you can verify that it looks like a valid address.

A valid e-mail address looks something like This email address is being protected from spambots. You need JavaScript enabled to view it.. So, it has some characters, followed by an @ symbol, followed by some more characters, a period, and two or three more (i.e., .com or .uk). There are some more restrictions, but these will suffice for our tutorial purposes.

First, an e-mail address is valid if it has the @ character in it only once, and if it’s not the first character.

( Position( EmailAddress, "@", 1, 1 ) > 1 )
  and ( PatternCount( EmailAddress, "@" ) = 1 )

For an e-mail address to be valid, the third-to-last or the fourth-to-last character must be a period. (For now, we’ll ignore .name and other newer domains.)

( Position( EmailAddress; "@"; 1; 1 ) > 1 )
  and ( PatternCount( EmailAddress; "@" ) = 1 )
  and ( ( Middle( EmailAddress; Length( EmailAddress ) - 3 ) = "."; 1 )
      or ( Middle( EmailAddress; Length( EmailAddress ) - 2 ) = "."; 1 ) )

The first line checks that the first @ symbol is not the first character. The second condition checks that there is only one @ symbol in the string. Parentheses enclose a compound third condition made up of two separate conditions, one checking that the fourth-to-last character is a period and the other checking that the third-to-last character is a period. If either of these is true, the third (compound) condition returns true. So an e-mail like This email address is being protected from spambots. You need JavaScript enabled to view it. will validate, but @chivalrysoftware.com, chuck@This email address is being protected from spambots. You need JavaScript enabled to view it. and chuck@chivalrysoftware.c will not validate.

Again, this isn’t a complete validation of e-mail addresses, but it should give you an idea of the thought process needed and the functions required to do a more complex validation.

• • •

I think that will do it for us this month. As an exercise in using the skills covered in this column, I would suggest writing the routine to extract the middle name from a full name, including the possibility that the middle name isn’t provided (i.e., return an empty string when FullName contains Charles Ross). Until next month, happy FileMaking!

Comments  

 
0 # Blanca Bustamante 2013-09-05 23:51
This is the easiest to understand explaination of parsing. Thank you!
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh

Search

Products