June 28, 2009

Web Service for Document Conversion – an Odyssey

Filed under: .Net,PHP,Programming,T3city — pj @ 2:02 pm

A couple of years ago, I needed a way to convert Microsoft Word documents to Pdf from a C# program. The application I was working on processed hundreds of documents and was run by the system scheduler every day at around 3am, so manual conversion was not an option. I wasn’t in control of the source documents, so I had to accept the documents the way they were given to me. I needed to do additional processing on the documents, so I wanted to convert them into a universal format. I already had a good library for reading Pdf files. After researching my options, I settled on using OpenOffice to do the conversion. OpenOffice has a pretty good Word filter, the ability to create Pdfs and an automation interface accessible to all .Net languages, including C#, so it was a good fit. I know there are commercial solutions and ways to automate Microsoft Office, but the OpenOffice solution was free and fairly easy to use.

Recently, I upgraded my development system from OpenOffice 2.x to OpenOffice 3.1. I can’t remember now the main reason I upgraded, but I was looking forward to being able to add the ability to convert docx to pdf (Office Open XML support was added in OpenOffice 3.x). I figured the upgrade might require some minor changes to my document conversion code, but it turned out not to be so simple.

Conversion Server Demo

The first problem was that OpenOffice no longer showed up under “Add Reference” in Visual Studio. After some Google searches, I finally stumbled across a solution. OK, wow, digging though install cab files for DLLs, hacking the registry and manipulating the PATH is ugly, but it worked – at least once. Almost as soon as my test program ran successfully, Visual Studio locked up. The only way to get control back was to kill the Visual Studio process, then restart it. Every time I opened the solution file, after just a few seconds, Visual Studio would lock up again. Removing every trace of the OpenOffice DLL references would prevent the lockups, but without OpenOffice, I could not convert documents. I eventually figured out that the problem was that the library project that used OpenOffice was also included in my web site project. Because the web site uses the library, too, the OpenOffice reference is automatically copied to the web site project. Normally, this would be no big deal – just a couple of small DLL files in the web site bin folder that would not be used – but in this case, Visual Studio could not successfully build the web site project as long as the OpenOffice DLLs were referenced.

Now that using OpenOffice was no longer as simple as “Add Reference”, it was time to take a look at a way to decouple my application from the OpenOffice DLLs. The API my application needs is pretty simple: Given the bytes of a file in Doc (or .docx) format, return the bytes of the file in Pdf format. Also, there are security risks whenever you open a file in a complicated piece of software like OpenOffice. These two facts pointed me to turning the Pdf conversion into a web service. The client application and web service can be partitioned across computers, so, if I wanted to, I could run the Pdf conversion service on a virtual machine with its own install of OpenOffice and tight security and resource control. The web service API didn’t need fancy XML: Clients could post the Doc file via a standard HTTP MIME encoded form. The result would be a the PDF byte stream returned via HTTP. This would make the service easy to develop, test and use from clients on many platforms.

I typically develop my large web based applications in C# and Asp.Net. However, for small services like document conversion, I often use PHP. Simple PHP applications can be just a file or two and I can move a PHP based service over to any web server that supports PHP. So, my first attempt at a document conversion service was written in PHP and used the OpenOffice COM API. Using my earlier OpenOffice API code as a guide and some PHP sample code I found on the web, I was able to quickly create a command line PHP application that could perform Pdf conversions. With this proof of concept working, I coded up the PHP web service. But the web service wouldn’t work. I could see in Task Manager that the OpenOffice executables were launched, but any code running under the web server that tried to access the OpenOffice COM objects would freeze up.

I spent a lot of time trying different security settings for various files and DCOM. I even tried creating a dedicated user and configuring DCOM to run OpenOffice as the dedicated user when the OpenOffice COM objects were activated. I changed Windows event log settings to log all security violations. Nothing would work and I couldn’t see anything useful in Event Viewer. I tried coding up a simple Asp.Net application and got the same results under the .Net OpenOffice API. Under both PHP/COM and C#/.Net, OpenOffice could be automated by a command line program but would not run under IIS. Right now, I develop on Vista and deploy to Windows Server 2008, so one might think the problem was trying to use IIS 7, but OpenOffice 2 didn’t have similar problems.

Finally, I gave up trying to run OpenOffice directly under IIS. The problem was either some type of security issue or that OpenOffice was trying to pop up registration dialogs. In Windows, the simple way to run a program under a given user account is to run the program as a Windows Service. It turns out that it is pretty easy to create a Windows Service in C#. The basic idea I came up is was:

  1. Install OpenOffice
  2. Create a Windows user account dedicated to Pdf conversion.
  3. Login as the Pdf conversion user and make sure that OpenOffice opens without prompting for registration, crash feedback, etc. (I suspect that the reason OpenOffice 3 won’t run properly under IIS is that it is prompting the user to register).
  4. Create a Windows Service with a simple API to convert documents
  5. Install the Windows Service and configure it to run as Pdf conversion user
  6. Access the Windows Service from my document conversion Web Service

Here is how the API calls flow: Application -> Web Service -> Windows Service

As I mentioned earlier, the API between the application and the web service was standard a HTTP Post. Between the Web Service and the Windows Service, I originally planned to use a simple, proprietary TCP/IP protocol. As long as you aren’t try to get too fancy coding thread pools and asynchronous I/O, at least in C/C++, it’s pretty easy to open a TCP socket and communicate between client and server with a line oriented protocol (like HTTP, POP3, SMTP, etc.). But, since both the Web Service and the Windows Service were going to run on Windows (at least for now), I decided to take a look at .Net Remoting. The .Net remoting solution was easy to code and use so I ended up sticking with it.

Below is some code to show you how everything fits together:

The C# interface IConverter is compiled into a DLL. The DLL is referenced by the client (Web Service) and server (Windows Service).

    public class ConverterEndpoints
    {
        public static int    Port = 7047;
        public static string Uri  = "DocumentToPdf";
    }

    public interface IConverter
    {
        void Convert(string srcDocument, string destPdf);
    }

The class ConverterEndpoints holds a couple of constants that the client and server use to setup the TCP port and service name. The code below demonstrates how ConverterEndpoints is used in the Windows Service. I’ve included code for the entire class so you can see how easy it is to program a Windows Service and access the Windows Event Log in C# – no 3rd party libraries are required.

The original version of this interface passed the files as simple byte[] variables. Theoretically, the files never need to be written to disk (input and output is via HTTP). But, I found that I could easily pass small byte[] variables across .Net remoting but large byte[] variables failed. Since the Web Service and the Windows Service run on the same machine and the file has to be written to disk for OpenOffice to do the conversion, I settled for creating temporary files and passing the names of the files between the Web Service and Windows Service.

   class WindowsService : ServiceBase
    {
        protected EventLog _log;
        protected TcpChannel _channel;

        public WindowsService()
        {
            ServiceName = "Teztech Document Conversion";
            EventLog.Source = "Teztech Document Conversion";
            EventLog.Log = "Application";
            
            // These Flags set whether or not to handle that specific
            //  type of event. Set to true if you need it, false otherwise.
            CanHandlePowerEvent = false;
            CanHandleSessionChangeEvent = false;
            CanPauseAndContinue = true;
            CanShutdown = true;
            CanStop = true;

            if (!EventLog.SourceExists("Teztech Document Conversion"))
                EventLog.CreateEventSource("Teztech Document Conversion", "Application");

            _log = new EventLog();
            _log.Source = "Teztech Document Conversion";

            _channel = new TcpChannel(ConvertInterface.ConverterEndpoints.Port);
            ChannelServices.RegisterChannel(_channel, false);

            RemotingConfiguration.RegisterWellKnownServiceType(typeof(DocumentToPdf), ConvertInterface.ConverterEndpoints.Uri, WellKnownObjectMode.Singleton);
        }

        static void Main()
        {
            ServiceBase.Run(new WindowsService());
        }

        /// <summary>
        /// Dispose of objects that need it here.
        /// </summary>
        /// <param name="disposing">Whether or not disposing is going on.</param>
        protected override void Dispose(bool disposing)
        {
            base.Dispose(disposing);
        }

        protected override void OnStart(string[] args)
        {
            _log.WriteEntry("Service Starting");

            _channel.StartListening(null);

            base.OnStart(args);

            _log.WriteEntry("Service Running");
        }

        protected override void OnStop()
        {
            _log.WriteEntry("Service Stopping");

            _channel.StopListening(null);

            base.OnStop();
        }

        protected override void OnPause()
        {
            _channel.StopListening(null);
            base.OnPause();
        }

        protected override void OnContinue()
        {
            _channel.StartListening(null);
            base.OnContinue();
        }
    }

Below is the code behind file for the Web Service client. As you can see, .Net makes this type of Web Service easy to create.

public partial class Pdf : System.Web.UI.Page
{
    protected TcpChannel _channel;

    protected void Page_Load(object sender, EventArgs e)
    {
        try
        {
            if (Request["Username"] == null || Request["Password"] == null)
                throw new ApplicationException("Username and Password are required");

            HttpPostedFile postedFile = Request.Files["InputFile"];
            if (postedFile == null || postedFile.FileName == null || postedFile.FileName == "")
                throw new ApplicationException("InputFile was not supplied or file name is empty.");

            string inputFileExtension = Path.GetExtension(postedFile.FileName);
            if (inputFileExtension.Length < 2)
                throw new ApplicationException("InputFile does not have a valid file name extension.");
            inputFileExtension = inputFileExtension.Substring(1);

            string outputFileName = Request["OutputFileName"];
            if (outputFileName == null || outputFileName == "")
                outputFileName = Path.GetFileNameWithoutExtension(postedFile.FileName) + ".pdf";

            DbRequest.CheckAuthorization(Request["Username"], Request["Password"]);

            if (_channel == null)
            {
                Dictionary<string, string> channelProperties = new Dictionary<string,string>();
                channelProperties["name"] = "";
                _channel = new TcpChannel(channelProperties, null, null);
                ChannelServices.RegisterChannel(_channel, false);
            }

            string url = string.Format("tcp://localhost:{0}/{1}", ConvertInterface.ConverterEndpoints.Port, ConvertInterface.ConverterEndpoints.Uri);

            ConvertInterface.IConverter converter = (ConvertInterface.IConverter)Activator.GetObject(typeof(ConvertInterface.IConverter), url);

            string path = MapPath(".");
            path = Path.Combine(Path.GetDirectoryName(path), "Temp");

            using (TempFileCollection tempFiles = new TempFileCollection(path))
            {
                string inputFile  = tempFiles.AddExtension(inputFileExtension);
                string outputFile = tempFiles.AddExtension("pdf");

                postedFile.SaveAs(inputFile);
                converter.Convert(inputFile, outputFile);

                Response.Clear();
                Response.ContentType = "application/pdf";
                Response.AddHeader("content-disposition", string.Format("inline; filename={0}.pdf", "sample"));

                byte[] pdfBytes = File.ReadAllBytes(outputFile);

                Response.OutputStream.Write(pdfBytes, 0, pdfBytes.Length);
            }
        }        
        catch (Exception ex)
        {
            Response.Clear();
            Response.Write(string.Format("<HTML><HEAD></HEAD><BODY><H1>Conversion Error</H1><P>{0}</P></BODY></HTML>", ex.Message));
            Response.StatusCode = 500;
            Response.StatusDescription = ex.Message;
        }
    }

For completeness, I have included the code for the actual Web Service’s .aspx file below. As you can see, it’s just standard .aspx. boilerplate – all HTTP output is generated in the code behind file’s Page_Load event (which will be invoked when the Web Service client posts to the page). This is quick and dirty, but it works.

<%@ Page Language="C#" AutoEventWireup="true" CodeFile="Pdf.aspx.cs" Inherits="Pdf" EnableViewState="false" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head runat="server">
    <title>Untitled Page</title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
    
    </div>
    </form>
</body>
</html>

Here And here is the code for a little PHP application I used to test my conversion Web Service:

<?php $ErrorMessage = ''; ?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
File Conversion Server Demo
</title>
</head>
<body>

<h1>File Conversion Server Demo</h1>

<form name="Demo" method="POST" enctype="multipart/form-data" action="/demo.pdf">

    <?php if($ErrorMessage): ?>
    <p><font size="5" color="#008000"><?php echo $ErrorMessage; ?></font></p>
    <?php endif; ?>
    
    <table>
        <tr>
            <td><b>Username:</b></td>
            <td><input name="Username" value=""></input></td>
        </tr>
        <tr>
            <td><b>Password:</b></td>
            <td><input name="Password" value=""></input></td>
        </tr>
        <tr>
            <td><b>Input File:</b></td>
            <td><input type="file" name="InputFile"></input> </td>
        </tr>
    </table>
    
    <input type="submit" value="Submit"></input> 
    
</html>

If you look the PHP form’s action tag, you’ll see I used /demo.pdf as the action. Using the very nice IIS7 URL Rewrite Module, I map all *.pdf URLs to my conversion Web Service. OK, this might be a little overkill, but this way the URLs look nice and all end in .pdf (which is appropriate).

The final piece of the puzzle is a small wrapper class I created to access the conversion Web Service from within my C# application:

    public class ConvertClient
    {
        public ConvertClient(string convertHost, string username, string password)
        {
            _ConvertHost = convertHost;
            _Username    = username;
            _Password    = password;
        }

        protected string _ConvertHost;
        public string ConvertHost { get { return _ConvertHost; } set { _ConvertHost = value; } }

        protected string _Username;
        public string Username { get { return _Username; } set { _Username = value; } }

        protected string _Password;
        public string Password { get { return _Password; } set { _Password = value; } }

        protected string _Response;
        public string Response { get { return _Response; } set { _Response = value; } }

        protected int _MaxConvertAttempts = 2;
        public int MaxConvertAttempts { get { return _MaxConvertAttempts; } set { _MaxConvertAttempts = value; } }

        protected string _ResponseCode;
        public string ResponseCode { get { return _ResponseCode; } set { _ResponseCode = value; } }

        protected static string _szBoundary    = "SEPARATORSTRINGTEZTECHDOTCOM1";
        protected static string _szBoundary2   = "\r\n--SEPARATORSTRINGTEZTECHDOTCOM1\r\n";
        protected static string _szBoundary3   = "\r\n--SEPARATORSTRINGTEZTECHDOTCOM1--";
        protected static string _szFileSizeHdr = "Content-Disposition: form-data; name=\"MAX_FILE_SIZE\"\r\n\r\n";
        protected static string _szFileHdrFmt  = "Content-Disposition: form-data; name=\"InputFile\"; filename=\"{0}\"\r\nContent-Type: application/octet-stream\r\n\r\n";

        public byte[] ConvertToPdf(byte[] inputFileBytes, string inputFileName, string outputFileName)
        {
            string url = string.Format("http://{0}/{1}?Username={2}&Password={3}", ConvertHost, outputFileName, Username, Password);

            // Calculate upload data size

            string szFileSizeData = string.Format("{0}", inputFileBytes.Length + 50000);
            string szFileHdr      = string.Format(_szFileHdrFmt, inputFileName);

            ASCIIEncoding ascii = new ASCIIEncoding(); // At this time, file names must be ascii
            List<byte> header = new List<byte>(); 
            List<byte> footer = new List<byte>(); 

            header.AddRange(ascii.GetBytes(_szBoundary2));      // MAX_FILE_SIZE field
            header.AddRange(ascii.GetBytes(_szFileSizeHdr));
            header.AddRange(ascii.GetBytes(szFileSizeData));
            header.AddRange(ascii.GetBytes(_szBoundary2));      // userfile field
            header.AddRange(ascii.GetBytes(szFileHdr));

            footer.AddRange(ascii.GetBytes(_szBoundary3));
            
            int cbContent = header.Count + inputFileBytes.Length + footer.Count;

            int attempts = 0;

            while (true)
            {
                try
                {
                    attempts++;

                    HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
                    webRequest.Method        = "POST";
                    webRequest.ContentType   = string.Format("multipart/form-data; boundary={0}\r\n", _szBoundary);
                    webRequest.ContentLength = cbContent;

                    using (Stream request = webRequest.GetRequestStream())
                    {
                        request.Write(header.ToArray(), 0, header.Count);
                        request.Write(inputFileBytes, 0, inputFileBytes.Length);
                        request.Write(footer.ToArray(), 0, footer.Count);
                    }

                    using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
                    {
                        using (BinaryReader reader = new BinaryReader(webResponse.GetResponseStream()))
                        {
                            byte[] pdfBytes = reader.ReadBytes((int)webResponse.ContentLength);
                            return pdfBytes;
                        }
                    }
                }
                catch
                {
                    if (attempts >= MaxConvertAttempts)
                        throw;
                }
            }
        }
    }

I didn’t see an easy way to build a MIME document in .Net, so I’m building up the MIME document (used in the HTTP Post) from scratch.

To debug the Windows Service, as shown in the code above, I added code to log to the Windows Event Viewer. In this case, I started with working document conversion code, so I didn’t end up needing to run the service under a debugger. But, one of the nice things about services running as a dedicated process (unlike Apache modules, ISAPI modules and Control Panel applets) is that, with just a simple change (adding a Main function), you can run the service as a regular command line application and attach the Visual Studio debugger to the running process.

To troubleshoot the Web Service, my first step was to disable “friendly errors”. With IIS7, you have to do it on both the web browser client and in IIS7. Debugging is just a matter of attaching the IIS7 process in Visual Studio (as usual for any web application).

All our production Windows servers use the 64 bit edition of Windows 2008 Server. Right now, I happen to do most of my development on a 32 bit edition of Windows Vista. Both platforms run IIS7 and usually, the 32 vs. 64 bit doesn’t cause any problems. However, the Windows version of OpenOffice only comes in a 32 bit version. This version runs fine on 64 bit Windows, but when programming OpenOffice via .Net, you need a 32 bit application, so you have to set the Target Platform in the Project Build Properties to X86.

Leave a Reply

Powered by Teztech