Rearranging thousands of pages in multiple PDFs

marketconnections · March 26, 2012

Here's the Application

We get dozens of PDF files from our clients with anywhere from 15 to 5000 pages each.
We get corresponding data files which indicate page ranges within each file that are treated as mailable packages.
We have to pull these PDFs apart and re-arrange the pages into production streams and postal sortation order.

How we currently do this:

Using 3rd party tools (PDFSAM and MacOS X Automator PDF Splitting) I turn each PDF file into thousands of enumerated individual files.

Various automated processes deal with figuring out production streams and sortation order.

I use Fusion Pro to generate final print output PDFs and VDX files from the thousands of reassembled PDF pages.

What I'd like to do:

Stop splitting into individual PDF files. The Splitting process is fraught with problems due to automation timing, can result in missing resources and can turn an optimized 50MB PDF into hundreds of MB of individual files.

What I've tried:

A Graphic Rule that basically does this

return PDF=CreateResource("OriginalBigPDF")

And dozens of Graphic Rules that does essentially this:

return Rule('PDF').Field('PageNo')

Of course, since a package can contain 2 or 40 pages, I have 40 rules that do the above, incrementing the value in field('PageNo')

Problem:

Composition is very very slowww. Output files are enormous, and Fusion Pro Server chokes on files where the record count gets above 25. Using the Split File method, FPS can spit out a file with hundreds of records.

I assume the problem is that the OriginalBigPDF, which can be 30 to 50 MB in some cases, is being recast and embedded every record.

Question

Is there a way to optimize the process? Create resources that are stored only once, but from which on any given record I can extract a series of pages? I was thinking of creating Graphic Variables in OnJobStart, but I don't know how I'd call those variables in a graphic Rule that determines which page to extract based on data in that record. Or is pre-splitting the files the most efficient way to go?

esmith · March 26, 2012

I would create a template with two pages, one being an overflow page. Each page would have a full-page text frame with page 1 set to overflow to page 2, repeated as needed. A single text rule, similar to your current graphic rule, would extract the multiple pages per record needed and build a result with all of the required pages included using graphic tags. The result would be flowed into as many pages as needed to accomodate each record. See this thread for an example.

Depending on the contents of the client-provided PDFs and the output requirements, I might first consider trying to reduce the file sizes in Acrobat by downsampling, cropping, compressing, etc prior to making them available as FP resources.

Not sure how large the resulting files would be, but I can't imagine they'd be any larger than what you already generate and the template itself would be much "tidier" which I believe would cut down on composition time.

marketconnections · March 26, 2012

Hi Eric,

I was considering using the overflow method for the sake of Template sanity (right now my template has 40 pages, turned on and off using OnRecordStart, depending on how many pages need to be composed for a particular record).

However, I don't think that'll help bring down the file size. I'm seeing some files compose output as large as 1GB based on 300 records that pull pages from 6 different 30 MB files. This is up substantially from when I was pulling in individual page files using CreateResource.

I don't want to optimize the client PDFs, since that often results in all kinds of strange artifacts. Plus, this means introducing additional processes into the workflow.

esmith · March 26, 2012

I'm seeing some files compose output as large as 1GB based on 300 records that pull pages from 6 different 30 MB files. This is up substantially from when I was pulling in individual page files using CreateResource.

I don't want to optimize the client PDFs, since that often results in all kinds of strange artifacts. Plus, this means introducing additional processes into the workflow.

My guess is that there will be a direct correlation between file size of client PDFs and size of output file. As a rule, we have always placed client PDFs into InDesign and re-exported new PDFs using custom settings which downsample art for minimum digital press specs. This also allows us to catch issues with unembedded fonts, spot colors, transparencies, etc. Although it is an added workflow step, it reduces print errors caused by unseen issues.

Having said that, if you still believe the added prep time does not positively offset your composition/print/troubleshooting time, then your best bet might be to output records in smaller record chunks. Additionally, using a proprietary output format (i.e. JLYT for Indigo press) would cut down on RIP times over the generic PDF format.

jurgmay · March 30, 2012

Sounds to me like the real issue is the manner in which you are receiving the files.

A 5000 page PDF that needs to be split and remerged does not seem like a very efficient process to begin with regardless of what tools you end up using to solve the problem. Can't you discuss the production issues with your client and have them review how they present the files in the first place?

Just a thought...

marketconnections · March 30, 2012

It would be very nice if we could train our customers, but there's no chance of that. The files come from a document management solution the client purchased, called Thunderhead. They come in batches of anywhere from letters to many thousands of letters in each of a few dozen files we get on a daily basis. The original spec called for investing in some horrible Pitney Bowes software called Streamweaver, that would pull apart the original postscript and rewrite it. I don't know if anyone has seen this software but it gave me night terrors when I evaluated it. I was very happy to be able to stick to Fusion Pro when I came up with the file split and reassembly solution. To be honest, it's nice that I don't have to worry about the content of the letters. The customer worries about that. I just need to take the letter data stream, postal sort it, print, lettershop and mail, so there's less to worry about on my end with constant changes to letter content.

I was simply hoping to make it more efficient by using the native ability of FP to pull arbitrary pages out from a PDF using PDF.pagenumber.

I did some controlled tests, with a single incoming PDF weighing in at 18MB, containing 3500+ pages. Each letter package was 2 PDF pages, to be duplexed. So there were just over 1700 letter packages in the file.

My old split file solution produced 3800 enumerated PDFs, totaling almost 1.8GB when split apart.

The data is sorted and then run through Fusion Pro where each record calls the appropriate PDFs in postal sort order and generates a new output file. Output is using 300 record chunks. Each of those chunk files is approx 320MB and takes around 30 seconds to produce.

Using the PDF.pagenumber method, whereby I do not split the incoming file, I merely cast it as a PDF variable and call page number based on the postal sorted data to do the output reassembly, my chunks are almost 420MB each. And each chunk takes over 2 minutes to compose.

I'm sure this is due to the fact that I must cast the PDF into a variable on every record, despite the fact that the file only requires it once. If I could cast the PDF as a variable only once, in OnJobStart for instance, perhaps that would work better. But I don't know how to cast a graphic resource variable in OnJobStart such that I can retrieve it in a Graphic Rule to return PDF.pagenumber for the current record.

The above test is simplified of course. Normally, each output file requires me splitting and reassembling pages from several incoming PDFs, not just one.

jurgmay · March 30, 2012

This sounds similar to a project that we do here every day. The quantities are much smaller but I think we're doing similar things with FP in respect of pulling certain pages from a PDF file.

I'm not going to be able to get any code examples for a few days but what I can tell you is that I spent a lot of time learning JavaScript and really optimising the code that I'd written. You mention referencing variables in OnRecordStart but I'd definitely look and putting your code in the 'Globals' area as this code is run before anything else and allows you to store data in memory before composition begins.

The main speed improvement came from using an object and inserting the code in the 'Globals' section of the rule editor. On my project this will pre-load a lot of information into memory which is then accessed on a per-record basis. By using a JS Object the processing is sooooo much faster.

I'll see if I can get some code samples to you so you can understand what I mean but like I say, that'll be in a few days time.

marketconnections · March 30, 2012

I'd be interested to see that code.

A: how do you cast a Graphic Resource into a variable in Globals, or OnJobStart for that matter.

B: How do you call that global Graphic Resource variable in a standard Graphic rule?

Something like:

Global: var TheGlobalPDF=pathtoTodaysPDFFile

Rules:

Graphic Rule: "CopyholeGraphic"

var thisPage=Field('PageNo')

return TheGlobalPDF.thisPage

So, on Record 1, "CopyholeGraphic" would pull Page 18 out of the Global Graphic Resource. On record 2, it would pull Page 2938, and so forth.

I've hunted but I can't figure a way to do this. You can "inject" Graphic Variables using AddGraphicVariable, and I assume you could do this in the Globals dialog. However, the resulting Variable can only be called, seemingly, in the Graphic Frame palette's Rule field. You can't call it in a Graphic Resource Rule -- or if you can, I'm unaware of what the syntax is. CreateResource() won't work. My simplified code shown above won't work.

Dan Korn · March 30, 2012

I'd be interested to see that code.

A: how do you cast a Graphic Resource into a variable in Globals, or OnJobStart for that matter.

B: How do you call that global Graphic Resource variable in a standard Graphic rule?

Something like:

Global: var TheGlobalPDF=pathtoTodaysPDFFile

Rules:
Graphic Rule: "CopyholeGraphic"
var thisPage=Field('PageNo')
return TheGlobalPDF.thisPage

So, on Record 1, "CopyholeGraphic" would pull Page 18 out of the Global Graphic Resource. On record 2, it would pull Page 2938, and so forth.

I've hunted but I can't figure a way to do this. You can "inject" Graphic Variables using AddGraphicVariable, and I assume you could do this in the Globals dialog. However, the resulting Variable can only be called, seemingly, in the Graphic Frame palette's Rule field. You can't call it in a Graphic Resource Rule -- or if you can, I'm unaware of what the syntax is. CreateResource() won't work. My simplified code shown above won't work.

You want to do something like this in your Graphic rule:

var r = CreateResource(TheGlobalPDF);
r.pagenumber = Int(Field("PageNo"));
return r;

Or you could "inject" the variable in OnRecordStart:

var r = CreateResource(TheGlobalPDF);
r.pagenumber = Int(Field("PageNo"));
FusionPro.Composition.AddGraphicVariable("YourVariableName", r.content)

You can't inject variables in OnJobStart; you have to be in the context of a record. As for the JavaScript Globals, you can't really do much there other than define variables and functions. You could declare a function which calls FusionPro.Composition.AddGraphicVariable, but you would have to call the function from a per-record rule such as OnRecordStart.

marketconnections · March 30, 2012

The first of your two examples is essentially how I'm doing it now, Dan. So, it doesn't sound like my issue with larger output files and slower processing can be fixed. Seems that the best option is to pre-split the large multipage PDFs into individual files and call them via standard Graphic Rules, as I've been doing?

marketconnections · April 2, 2012

I've done a number of further tests. There's something very wrong about the PDFs I'm getting from the client. Though they seem properly optimized, where thousands of pages fit in 20 or 30 MB, and sometimes much less, when I pull individual pages using pdf.pagenumber in FusionPro, the file sizes of the output files bloat up like crazy. If I take a PDF document that I create, with hundreds of pages, and assemble the output using the exact same methods, I get an output file that scales up in size only slightly l larger than the original input PDF, which is to be expected.

Something about the way these documents are being generated is creating PDF code that is causing FP to generate 4 or 5 MB of data for every output page, regardless of whether the extracted page is blank, has text or has text and graphics.

Since this is not the case with other PDF files, I'm going to assume it's my client's fault.

Thanks for the input.

Sign In

Rearranging thousands of pages in multiple PDFs

Recommended Posts

marketconnections

Link to comment

Share on other sites

esmith

Link to comment

Share on other sites

marketconnections

Link to comment

Share on other sites

esmith

Link to comment

Share on other sites

jurgmay

Link to comment

Share on other sites

marketconnections

Link to comment

Share on other sites

jurgmay

Link to comment

Share on other sites

marketconnections

Link to comment

Share on other sites

Dan Korn

Link to comment

Share on other sites

marketconnections

Link to comment

Share on other sites

marketconnections

Link to comment

Share on other sites

Archived

Community

Activity

FusionPro.com