Jump to content

Pulling text from large tagged document


dbarbee

Recommended Posts

I'm setting up a variable letter that we merge frequently. I'm provided a Microsoft Word document that is completely static. I plan on copy and pasting this into a FusionPro text box, which preserves the formatting.

 

There are elements within this Microsoft Word Document that I would like to use in other parts of the variable letter. I am attempting to create these variables as JavaScript Global Variables. Attached is my code (In JavaScript Globals) so far:

 

var frame = FindTextFrame('Letter');
var letter = frame.content.split('</para>');
var lines = [];

//Strips out tagged text formatting in letter, and places it in 'lines' variableName
for (var i=0; i<letter.length; i++){
   if (letter[i] != '')
       lines.push(Trim(RawTextFromTagged(letter[i])));
}

lines = lines.filter(String); // tidy's up array

//Assumes Date is in 3rd line of letter. Would like make more robust. i.e. find line that matches "Wednesday, April 1, 2020";
var EventDate = lines[2];

//Looks for the Company Name, and returns the two lines following.
for (var i=0; i<lines.length; i++){
   if (lines[i] == 'Company Name'){
       var EventAddress = lines[i+1];
       var EventCity = lines[i+2];
       break;
       }
   }

//Looks for "Guest" and returns everything after the colon;
for (var i=0; i<lines.length; i++){
   if (lines[i].search('Guest') == 0){
       var PresenterLine = lines[i].split(':');
       var GuestSpeaker = PresenterLine[1];
       break;
       }
   }

//Looks for a phone number... not working. Needs to match "(###)###-####." because it's usually at end of sentence.
for (var i=0; i<lines.length; i++){
   var words = lines[i].replace(') ',')').split(' '); //Removes space after parentheses so phone number ends up as one word.
   for (var j=0; j<words.length; j++){
       if (words[j].match(/\(?[\d]{3}\)?[\d]{3}?[\d]{4}$\./)){
           var CompanyPhone = words[j];
           break;
           }
       }	
   }

 

The biggest issue is I'm having trouble matching the phone number. Most of the time, the phone number is in the format "(###)###-####" but can deviate slightly. It's always at the end of the sentence, so it will end up with a period at the end.

 

I'm also wondering if there is a better way of extracting the date from this letter. It will always be on its own line in the format: "Wednesday, April 1, 2020"

Edited by dbarbee
Link to comment
Share on other sites

Sounds like a interesting project.

 

If you could post any kind of example, at the very least the Word document, but preferably your collected template, that would make it a lot easier to follow what you're doing and offer specific suggestions, not just for how to extract the data from the Word document, but also for how to apply the extracted values in other places in the job.

Link to comment
Share on other sites

Attached is a version of the document in regards to.

Thanks, that helps me better envision what you're trying to accomplish.

 

It looks like you already have it mostly working. The way you're parsing the lines of text from the frame is pretty clever.

 

Though I have to add a bit of a disclaimer here, in that I always recommend against this kind of fuzzy logic to parse out already formatted or composed output. It's a bit like trying to unmake soup into its ingredients. You're always better off dealing with the source data as much as possible. Presumably the Word document was created via some kind of mail merge, based on some original "raw" data. If you can get your hands on that original data, that would make things much more straightforward. But I assume that you don't have access to that, which is why you're trying to do this extraction in the first place.

 

The other caveat I would add is that I (or someone else here in the community) can help you to figure out some JavaScript magic to parse the data you supplied in that one Word document, but it's hard to know exactly how well that parsing logic will work based on only one set of data, which is all you have provided. If you could supply a couple more examples of these Word documents (i.e. data records), it would give me (or anyone else) a better idea of how much variability we're dealing with in the data, and how robust the parsing code needs to be to handle various edge cases.

 

All that said, I would suggest a couple of things. First, if you put this parsing logic into OnRecordStart, then you can call FusionPro.Composition.AddVariable for each extracted variable, which will allow you to use those composition variables directly in text frames, without having to actually create global JavaScript variables and rules for each. (Alternatively, you could call FusionPro.Composition.AddTextReplacement to directly replace markers in text, without even needing to insert text variables.) Also, you can make just one pass through the lines to find what you need.

 

The attached template shows how to do this. Note that I've removed many of the rules in favor of simply calling FusionPro.Composition.AddVariable. I've also removed everything in the JavaScript Globals. Now, if you do need other logic to massage that data, then you'll need to either put that logic into OnRecordStart, like I've done for the "Specialist Name Only" field, or you'll need to move the first line of OnRecordStart var capturedVars = {}; to the JavaScript Globals and then, in other rules, do something like this:

if (FusionPro.inValidation)
   Rule("OnRecordStart");

var val = capturedVars["Specialist Name Only"];
// do something with val...

 

 

As for finding the phone number, I got it to work with this:

line.match(/\(*\d{3}[\D]*\d{3}[\D]*\d{4}/);

Though when you say that the format "can deviate slightly," as noted above, this is where I would need to know a little more about those variations in order to write code to handle them.

 

Parsing the date is a bit trickier. If you know it's always going to be a line starting with an English weekday name, then it seems pretty simple:

if (/^((Monday)|(Tuesday)|(Wednesday)|(Thursday)|(Friday)|(Saturday)|(Sunday))/.test(line))

If the line doesn't always start with a weekday name, that's trickier. But I would need to know all the possible formats to look for in order to code up something to handle those other cases. (See previous comment about not knowing about any other records of data other than the single one provided.)

 

All that is also in the attached template.

 

Hope this helps, and thanks again for sharing the template!

Invite - Clean - Dan-3.pdf

Edited by Dan Korn
Link to comment
Share on other sites

This is brilliant, and I learned several new things. Thank you for this.

 

The deviations in the phone number are mostly minor. The three formats I've seen are below, but it's mostly presented exactly like the sample:

(555)555-5555

(555) 555-5555

555-555-5555

Link to comment
Share on other sites

This is brilliant, and I learned several new things. Thank you for this.

Glad I could help! I learned a couple things too.

The deviations in the phone number are mostly minor. The three formats I've seen are below, but it's mostly presented exactly like the sample:

(555)555-5555

(555) 555-5555

555-555-5555

I think the Regular Expression I came up with will handle those cases as well. Post back if you run into something that it doesn't match.

Link to comment
Share on other sites

One question: Is it possible to create resources or variables like this in OnJobStart? Thinking about the efficiency/speed in composition.

Variables are generally per-record. That's the whole idea of variable data.

 

You could create text replacements on a per-job basis.

 

Or you could set up that capturedVars object as a global, populate it in OnJobStart, then just do the last few lines on OnRecordStart to iterate through that object and call FusionPro.Composition.AddVariable for each property.

Link to comment
Share on other sites

Is there a way to reference variables created with FusionPro.Composition.AddVariable in other rules? In particular, I need to return a Graphic of the presenter.

I talked about this in my previous post:

Now, if you do need other logic to massage that data, then you'll need to either put that logic into OnRecordStart, like I've done for the "Specialist Name Only" field, or you'll need to move the first line of OnRecordStart var capturedVars = {}; to the JavaScript Globals and then, in other rules, do something like this:

if (FusionPro.inValidation)
   Rule("OnRecordStart");

var val = capturedVars["Specialist Name Only"];
// do something with val...

Note that in OnRecordStart, you can also call FusionPro.Composition.AddGraphicVariable(), so you could do something in that loop like:

FusionPro.Composition.AddGraphicVariable("Specialist Photo", Resource(capturedVars["Specialist Name Only"]));

Or call another mapping if the resource names don't exactly correspond to the names in the data.

Edited by Dan Korn
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...