Compressing images in PDF files

How can I most effectively compress scanned page images in PDF files without unduly degrading the visual quality? I've been trying it with ImageMagick and a file that I want to compress, but so far without achieving much compression. The ImageMagick identify command, applied to one of the original pages shows: PDF 595x842 595x842+0+0 16-bit Bilevel DirectClass 63.2KB 0.000u 0:00.009 I've been experimenting a little with the -compress, -density and -quality commands of the convert command, but without as much progress as I would prefer. In most cases the output is larger than the input.

On Tue, Oct 16, 2012 at 10:00 AM, Jason White <jason@jasonjgw.net> wrote:
How can I most effectively compress scanned page images in PDF files without unduly degrading the visual quality?
I've been trying it with ImageMagick and a file that I want to compress, but so far without achieving much compression.
The ImageMagick identify command, applied to one of the original pages shows: PDF 595x842 595x842+0+0 16-bit Bilevel DirectClass 63.2KB 0.000u 0:00.009
I've been experimenting a little with the -compress, -density and -quality commands of the convert command, but without as much progress as I would prefer. In most cases the output is larger than the input.
I would think that the pdf composition utility would use Flate encryption to store the graphic as a PDF stream object. So, it should be compressed by default. -Matt

Matt Davis <mattdavis9@gmail.com> wrote:
I would think that the pdf composition utility would use Flate encryption to store the graphic as a PDF stream object. So, it should be compressed by default.
I am prepared to have the images resampled, however, which I understand ImageMagick can do. The image format can be different in the output file as wel. Ensuring that zip compression is used on the data streams is important, but it isn't the only mechanism available.

Jason White <jason@jasonjgw.net> writes:
How can I most effectively compress scanned page images in PDF files without unduly degrading the visual quality?
Do you have access to the documents that built the PDF? i.e. the foo.tex and foo-1.jpg ? If so it should be easy -- just deal with the images before they enter the PDF. jpegoptim -m75, pngcrush, etc. I don't know how best to compress embedded vector images -- they're usually embedded as PDF (instead of EPS), but I guess you would do path simplification on the source in inkscape or whatever...
I've been trying it with ImageMagick and a file that I want to compress, but so far without achieving much compression.
AFAICT imagemagick operates on PDFs by calling gs to do all the work. You could ask #imagemagick on freenode.
The ImageMagick identify command, applied to one of the original pages shows: PDF 595x842 595x842+0+0 16-bit Bilevel DirectClass 63.2KB 0.000u 0:00.009
I've been experimenting a little with the -compress, -density and -quality commands of the convert command, but without as much progress as I would prefer. In most cases the output is larger than the input.
I don't think imagemagick is the best tool for this. However I did recently have success improving scanned receipts (which the scanner gave as JPEG-in-a-PDF) using pdftoimage, then using imagemagick to reduce the size of the image a quarter and convert it to a monochrome PNG. Don't forget +repage when you resize.

Trent W. Buck <trentbuck@gmail.com> wrote:
I don't think imagemagick is the best tool for this.
However I did recently have success improving scanned receipts (which the scanner gave as JPEG-in-a-PDF) using pdftoimage, then using imagemagick to reduce the size of the image a quarter and convert it to a monochrome PNG. Don't forget +repage when you resize.
Thanks. That's very helpful.

On 16/10/12 13:08, Jason White wrote:
Trent W. Buck<trentbuck@gmail.com> wrote:
I don't think imagemagick is the best tool for this.
However I did recently have success improving scanned receipts (which the scanner gave as JPEG-in-a-PDF) using pdftoimage, then using imagemagick to reduce the size of the image a quarter and convert it to a monochrome PNG. Don't forget +repage when you resize.
Hi Jason I have been using pdfsizeopt to resize pdf files. It uses jbig2 as the encoder and the reduction in size varies. The last pdf went from 51MB to 13MB with little appreciable difference in quality. http://code.google.com/p/pdfsizeopt/ Cheers Nic

Nic Baxter <nic@nicbaxter.com.au> wrote:
I have been using pdfsizeopt to resize pdf files. It uses jbig2 as the encoder and the reduction in size varies. The last pdf went from 51MB to 13MB with little appreciable difference in quality.
Thanks for the reference. Simply using GhostScript to rewrite the file took 600K from the size of the PDF document that I tried (3.2MB vs. 3.8MB, approximately).

Jason White <jason@jasonjgw.net> writes:
Nic Baxter <nic@nicbaxter.com.au> wrote:
I have been using pdfsizeopt to resize pdf files. It uses jbig2 as the encoder and the reduction in size varies. The last pdf went from 51MB to 13MB with little appreciable difference in quality.
Thanks for the reference.
Simply using GhostScript to rewrite the file took 600K from the size of the PDF document that I tried (3.2MB vs. 3.8MB, approximately).
Another thing that is good to check when *creating* PDFs, is if you are using embedded or standard fonts. pdffonts lists what fonts the PDF uses, their types, and if they're embedded. Using Times instead of Times New Roman can reduce the size by an order of magnitude for small text documents.

Jason White <jason@jasonjgw.net> wrote:
Simply using GhostScript to rewrite the file took 600K from the size of the PDF document that I tried (3.2MB vs. 3.8MB, approximately).
Other levels of compression, with corresponding effects on image quality, are also possible. See http://www.peteryu.ca/tutorials/publishing/pdf_manipulation_tips

On Tue, 16 Oct 2012, Jason White wrote:
How can I most effectively compress scanned page images in PDF files without unduly degrading the visual quality?
I've been trying it with ImageMagick and a file that I want to compress, but so far without achieving much compression.
The ImageMagick identify command, applied to one of the original pages shows: PDF 595x842 595x842+0+0 16-bit Bilevel DirectClass 63.2KB 0.000u 0:00.009
I've been experimenting a little with the -compress, -density and -quality commands of the convert command, but without as much progress as I would prefer. In most cases the output is larger than the input.
It has been for-ever-and-a-half, but astro-ph (and other arXiV preprint servers) only ever accepted submissions of the order of ~1MB. This really sucks when you have 10 diagrams of 200,000 particles each. I believe my way around it was to use jpeg2ps to generate postscript images with embedded jpegs since the postscript protocol mandates a jpeg decoder. But I'm not sure my memory and my poorly documented ("This program does... (author has been too lazy to update this)") shell scripts serves me correctly as I had been thinking this was for pdfs and not postscript images. Seems it's not in debian but google returns some promising links. Probably not dfsg free - a lot of latex stuff isn't. -- Tim Connors

On 25/10/12 18:33, Tim Connors wrote:
On Tue, 16 Oct 2012, Jason White wrote:
How can I most effectively compress scanned page images in PDF files without unduly degrading the visual quality?
I've been trying it with ImageMagick and a file that I want to compress, but so far without achieving much compression.
The ImageMagick identify command, applied to one of the original pages shows: PDF 595x842 595x842+0+0 16-bit Bilevel DirectClass 63.2KB 0.000u 0:00.009
I've been experimenting a little with the -compress, -density and -quality commands of the convert command, but without as much progress as I would prefer. In most cases the output is larger than the input.
It has been for-ever-and-a-half, but astro-ph (and other arXiV preprint servers) only ever accepted submissions of the order of ~1MB. This really sucks when you have 10 diagrams of 200,000 particles each. I believe my way around it was to use jpeg2ps to generate postscript images with embedded jpegs since the postscript protocol mandates a jpeg decoder.
But I'm not sure my memory and my poorly documented ("This program does... (author has been too lazy to update this)") shell scripts serves me correctly as I had been thinking this was for pdfs and not postscript images.
Seems it's not in debian but google returns some promising links. Probably not dfsg free - a lot of latex stuff isn't.
Source is at http://www.pdflib.com/download/free-software/jpeg2ps/ Hasn't been altered since 2002. To go from there to pdf just use ghostscripts' ps2pdf: ./jpeg2ps nesrin.jpg | ps2pdf - > nesrin.pdf Interestingly, the pdf is a little smaller than the original jpeg. Jpeg is not always the best compression for scans of text - gif or png of a black/white image may be smaller and without edge artifacts. gif2ps is part of the giflib package. libpng is the core of a number of converters - see http://www.libpng.org/pub/png/pngapcv.html

Allan Duncan <amd2345@fastmail.com.au> wrote:
Jpeg is not always the best compression for scans of text - gif or png of a black/white image may be smaller and without edge artifacts.
gif2ps is part of the giflib package.
Thanks, that's useful to know. Although I've solved my problem for now, it can be expected to arise in the future from time to time, hence the extra advice is helpful.
participants (6)
-
Allan Duncan
-
Jason White
-
Matt Davis
-
Nic Baxter
-
Tim Connors
-
trentbuck@gmail.com