How to convert a web page to a PDF using iText.Net and a WebBrowser object

by Heathesh 12. August 2010 06:09

Disclaimer: This is a hack, it uses screen captures using the WebBrowser object to populate the PDF. I did it in my spare time for fun, and I would not advise using it in a production environment without thorough testing and nerves of steel.

I've noticed a few websites that charge a fee to convert a web page to a PDF. So I was curious if there was an open source alternative to this. I know there are libraries to create PDF's, but all my googling could not find one that would convert a web page or HTML to PDF. Having played around with the iText.Net (http://sourceforge.net/projects/itextdotnet/) libraries before, I thought I would like to see if there was a way to use them to accomplish this for free.

To begin with, you need to download the relevant libraries from http://sourceforge.net/projects/itextdotnet/. For convienience sake I've added the DLL's I used here (zipped - 2,816 KB download):

http://heathesh.com/ftp/itextnet.zip

I'm going to be using Visual Studio 2008 for this and have not tried it using Visual Studio 2010. So using Visual Studio 2008 create a Console Application project. This needs to work in a single-threaded apartment state so using a web page or web service would require you to do more "hacking". You can see my previous post "Run a single-threaded apartment method with parameters that returns a value within a web service" on what is needed to achieve that.

Once your project has been created add all the DLL's from the ZIP file as references to your project. Also add references to the following .Net DLLs:

System.Drawing
System.Windows.Forms

Now add the following usings to your console app Program.cs file:

//NOTE: Do not add System.Drawing as the namespace will cause conflicts with classes in the PDF libraries
using System.IO;
using System.Windows.Forms;
using com.lowagie.text;
using com.lowagie.text.pdf;


Next you need to add the following method to your code. The method will load a web browser, navigate to the specified url and take a screen shot of the web page.

        /// <summary>
        /// Generate the screen shot image for the specified URL
        /// </summary>
        /// <param name="url"></param>
        /// <param name="width"></param>
        /// <param name="height"></param>
        /// <returns></returns>
        public static System.Drawing.Bitmap generateScreenshotImage(string url, int width, int height)
        {
            // Load the webpage into a WebBrowser control
            using (WebBrowser webBrowser = new WebBrowser())
            {
                //disable the scroll bars and supress script errors, then navigate to the url
                webBrowser.ScrollBarsEnabled = false;
                webBrowser.ScriptErrorsSuppressed = true;
                webBrowser.Navigate(url);

                //wait for the page to load
                while (webBrowser.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); }

                // Set the size of the WebBrowser control
                webBrowser.Width = width;
                webBrowser.Height = height;

                if (width == -1)
                    // Take Screenshot of the web pages full width
                    webBrowser.Width = webBrowser.Document.Body.ScrollRectangle.Width;

                if (height == -1)
                    // Take Screenshot of the web pages full height
                    webBrowser.Height = webBrowser.Document.Body.ScrollRectangle.Height;

                // Get a Bitmap representation of the webpage as it's rendered in the WebBrowser control
                System.Drawing.Bitmap bitmap = new System.Drawing.Bitmap(webBrowser.Width, webBrowser.Height);
                webBrowser.DrawToBitmap(bitmap, new System.Drawing.Rectangle(0, 0, webBrowser.Width, webBrowser.Height));
                return bitmap;
            }
        }

Okay... so we've now got a method to generate a screen shot of the web page, next we need to be able to retrieve the byte[] array of the specified bitmap the above method returns. So add the following method to your code:

        /// <summary>
        /// Gets the byte array for the specified bitmap
        /// </summary>
        /// <param name="bitmap"></param>
        /// <returns></returns>
        private static byte[] getBytesForBitmap(System.Drawing.Bitmap bitmap)
        {
            using (MemoryStream memoryStream = new MemoryStream())
            {
                bitmap.Save(memoryStream, System.Drawing.Imaging.ImageFormat.Png);
                return memoryStream.GetBuffer();
            }
        }

We're almost done. We now need to add methods to use the bitmap created above and create the actual PDF. For this I created and added the following four methods. The code comments should be self-explanatory:

        /// <summary>
        /// Converts the specified url to a pdf file
        /// </summary>
        /// <param name="url"></param>
        /// <param name="fileName"></param>
        private static void convertUrlToPdf(string url, string fileName)
        {
            byte[] pdfBytes = convertWebPageToPdf(url);

            using (FileStream fileStream = new FileStream(fileName, FileMode.Create, FileAccess.ReadWrite))
            {
                BinaryWriter binaryWriter = new BinaryWriter(fileStream);
                binaryWriter.Write(pdfBytes);
                binaryWriter.Close();
                fileStream.Close();
            }
        }

        /// <summary>
        /// Converts the specified url to a PDF byte array
        /// </summary>
        /// <param name="url"></param>
        /// <returns></returns>
        private static byte[] convertWebPageToPdf(string url)
        {
            // step 1: creation of a document-object
            Document document = new Document();

            // step 2:
            // we create a writer that listens to the document
            // and directs a PDF-stream to a file
            MemoryStream memoryStream = new MemoryStream();
            PdfWriter.getInstance(document, memoryStream);

            // step 3: we open the document
            document.open();

            //get the screen shot of the web page
            using (System.Drawing.Bitmap screenshot = generateScreenshotImage(url, 1020, -1))
            {
                //if there is more than one page, split the image otherwise just add the image
                if (screenshot.Height > 1500)
                    separatePages(document, screenshot);
                else
                    addImage(document, screenshot);
            }

            // step 5: we close the document
            document.close();
           
            //return the byte[] of the pdf
            return memoryStream.GetBuffer();
        }

        /// <summary>
        /// Add the image to the PDF document
        /// </summary>
        /// <param name="document"></param>
        /// <param name="screenshot"></param>
        private static void addImage(Document document, System.Drawing.Bitmap screenshot)
        {
            Image png = Image.getInstance(getBytesForBitmap(screenshot));
            png.scalePercent(50);
            document.add(png);
        }

        /// <summary>
        /// Separates the pages of the bitmap into the PDF document
        /// </summary>
        /// <param name="document"></param>
        /// <param name="screenshot"></param>
        private static void separatePages(Document document, System.Drawing.Bitmap screenshot)
        {
            int reminder = screenshot.Height % 1500;
            int pages = screenshot.Height / 1500 + (reminder > 0 ? 1 : 0);
            int y = 0;
            int height = 1500;

            for (int i = 0; i < pages; i++)
            {
                //if this is the last page, and we have a reminder, we need to adjust the height accordingly
                if (i == pages - 1 && reminder > 0)
                    height = screenshot.Height - y;

                using (System.Drawing.Bitmap pageBitmap = screenshot.Clone(new System.Drawing.Rectangle(0, y, 1020, height), System.Drawing.Imaging.PixelFormat.DontCare))
                {
                    //add the image
                    addImage(document, pageBitmap);

                    //increment the height counter to move to the next page
                    y += 1500;
                }
            }
        }

Okay so we've now got everything we need. In our Main method we simply need to call the convertUrlToPdf method with a URL and PDF file name and it will generate a PDF of the website for us. There is just one thing, the WebBrowser control can only run in a single-threaded apartment state. So we need to decorate our Main method with the STAThread attribute:

        [STAThread] //run in single-threaded apartment state
        static void Main(string[] args)
        {
            convertUrlToPdf("http://iservice.co.za", "iservice.pdf");
        }


That's basically it. Run the code and you'll find the iservice.pdf in your output (bin\Debug - depending on your configuration mode) folder.

Happy PDFing!

Tags: , , , , ,

Development | .Net | Visual Studio 2008 | VS2008 | PDF

Run a single-threaded apartment method with parameters that returns a value within a web service

by Heathesh 21. July 2010 20:22

The need arose for me to run a method that used the WebBrowser object within a web service. The problem that I had is that you cannot run a WebBrowser object unless it's run in a single-threaded apartment. Within a windows forms or console application it's easy enough to set the apartment state by decorating your method with the [STAThread] attribute or using:

//set the current thread's apartment state
Thread.CurrentThread.SetApartmentState(ApartmentState.STA);


That does not work within a web service. My method also needed to accept parameters and to return a value. The method I needed to run looked like this:

public static byte[] GenerateScreenshot(string url, CustomImageFormat imageFormat, int width, int height)

This method was contained within a manager class I had created called GraphicsManager. As you can see it returns a byte[] and needs to accept four parameters. To call this method from the web service I first created a static private variable (make sure you create static variables) in my web service class for each of the parameters as well as the byte array like so:

        /// <summary>
        /// The returned image
        /// </summary>
        private static byte[] _returnedImage = null;
        /// <summary>
        /// The url
        /// </summary>
        private static string _url = string.Empty;
        /// <summary>
        /// The custom image format
        /// </summary>
        private static CustomImageFormat _imageFormat = CustomImageFormat.Gif;
        /// <summary>
        /// The width
        /// </summary>
        private static int _width = -1;
        /// <summary>
        /// The height
        /// </summary>
        private static int _height = -1;


I set default values for the variables simply out of habit. Next I created a single-threaded apartment method to execute my manager method like so. Notice that I decorated the method with the [STAThread] attribute and that the method is also a static method:

        /// <summary>
        /// Static method to run under Apartment State
        /// </summary>
        [STAThread]
        private static void generateScreenShot()
        {
            _returnedImage = GraphicsManager.GenerateScreenshot(_url, _imageFormat, _width, _height);
        }


As you can see I'm using the private variables I setup above in call to GraphicsManager.GenerateScreenshot. The last thing I needed to do was to create the WebMethod on my web service itself. I did this like so:

        /// <summary>
        /// Generates a portion of a screen shot image of the URL specified with with width and height specified and returns the byte stream of it in the specified format
        /// </summary>
        /// <param name="url">String of the URL to thumb nail</param>
        /// <param name="imageFormat">CustomImageFormat enum of the image format to return</param>
        /// <param name="width">Int of the width of the image to be returned</param>
        /// <param name="height">Int of the height of the image to be returned</param>
        /// <returns>Byte array of the thumb nail in the CustomImageFormat specified</returns>
        [WebMethod(Description = @"Generates a portion of a screen shot image of the URL specified with with width and height specified and returns the byte stream of it in the specified format.")]
        public byte[] GenerateWebsiteScreenshot(string url, CustomImageFormat imageFormat, int width, int height)
        {
            //set the private property values to the values passed in
            _url = url;
            _imageFormat = imageFormat;
            _width = width;
            _height = height;

            Thread apartmentThread = new Thread(new ThreadStart(generateScreenShot));

            //set the apartment state to single-threaded apartment
            apartmentThread.SetApartmentState(ApartmentState.STA);
            apartmentThread.Start();

            //block the calling thread so as to wait for the method to complete processing
            apartmentThread.Join();

            //return the private property value that was generated in the apartment state thread
            return _returnedImage;
        }


That was it. My web service method could now run and utilize the web browser object with no problems...

Happy threading!

Tags: , , ,

.Net | Web Services

Powered by BlogEngine.NET 1.5.0.7 (with enhancements by Heathesh)
Theme by Mads Kristensen (with tweeks by Heathesh)

Certifications

Microsoft Certified Professional

Microsoft Certified Technology Specialist

Calendar

<<  September 2010  >>
MoTuWeThFrSaSu
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910

View posts in large calendar