在不使用Microsoft.Office.Interop的.NET Core中将Word doc和docx格式转换为PDF

问题描述:

我需要在浏览器中显示Word .doc.docx文件.没有真正的客户端方法可以执行此操作,并且出于法律原因,这些文档不能与Google文档或Microsoft Office 365共享.

I need to display Word .doc and .docx files in a browser. There's no real client-side way to do this and these documents can't be shared with Google docs or Microsoft Office 365 for legal reasons.

浏览器不能显示Word,但是可以显示PDF,因此我想将这些文档转换为服务器上的PDF,然后显示出来.

Browsers can't display Word, but can display PDF, so I want to convert these docs to PDF on the server and then display that.

我知道可以使用Microsoft.Office.Interop.Word完成此操作,但是我的应用程序是.NET Core,并且无法访问Office互操作.它可以在Azure上运行,但也可以在其他任何容器的Docker容器中运行.

I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop. It could be running on Azure, but it could also be running in a Docker container on anything else.

似乎有很多类似的问题,但是大多数人都在问全框架.NET或假设服务器是Windows OS,而任何答案对我来说都是没有用的.

There appear to be lots of similar questions to this, however most are asking about full- framework .NET or assuming that the server is a Windows OS and any answer is no use to me.

如何将.doc.docx文件转换为.pdf 而不能访问Microsoft.Office.Interop.Word?

How do I convert .doc and .docx files to .pdf without access to Microsoft.Office.Interop.Word?

这真是PITA,难怪所有第三方解决方案都向每个开发人员收取500美元的费用.

This was such a PITA, no wonder all the 3rd party solutions are charging $500 per developer.

好消息是 Open XML SDK最近添加了对.Net Standard的支持,看起来您很幸运,使用.docx格式.

Good news is the Open XML SDK recently added support for .Net Standard so it looks like you're in luck with the .docx format.

当前坏消息 .NET Core上的PDF生成库没有太多选择.既然您似乎不想支付任何费用,并且您不能合法使用第三方服务,那么除了推出自己的服务之外,我们别无选择.

Bad news at the moment there isn't a lot of choice for PDF generation libraries on .NET Core. Since it doesn't look like you want to pay for one and you cant legally use a 3rd party service we have little choice except to roll our own.

主要问题是将Word文档内容转换为PDF.流行的方法之一是将Docx读取为HTML并将其导出为PDF.很难找到,但是有.Net Core版本的OpenXMLSDK- PowerTools 支持将Docx转换为HTML.拉取请求即将被接受",您可以从此处获取:

The main problem is getting the Word Document Content transformed to PDF. One of the popular ways is reading the Docx into HTML and exporting that to PDF. It was hard to find, but there is .Net Core version of the OpenXMLSDK-PowerTools that supports transforming Docx to HTML. The Pull Request is "about to be accepted", you can get it from here:

https://github.com/OfficeDev/Open-Xml-PowerTools/树/abfbaac510d0d60e2f492503c60ef897247716cf

现在我们可以将文档内容提取为HTML,我们需要将其转换为PDF.有一些库可以将HTML转换为PDF,例如 DinkToPdf 是Webkit的跨平台包装器HTML到PDF库libwkhtmltox.

Now that we can extract document content to HTML we need to convert it to PDF. There are a few libraries to convert HTML to PDF, for example DinkToPdf is a cross-platform wrapper around the Webkit HTML to PDF library libwkhtmltox.

我认为DinkToPdf优于 https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce

I thought DinkToPdf was better than https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce

Docx到HTML

我们将其全部放在一起,下载OpenXMLSDK-PowerTools .Net Core项目并进行构建(只是OpenXMLPowerTools.Core和OpenXMLPowerTools.Core.Example-忽略另一个项目).将OpenXMLPowerTools.Core.Example设置为StartUp项目.运行控制台项目:

Lets put this altogether, download the OpenXMLSDK-PowerTools .Net Core project and build it (just the OpenXMLPowerTools.Core and the OpenXMLPowerTools.Core.Example - ignore the other project). Set the OpenXMLPowerTools.Core.Example as StartUp project. Run the console project:

static void Main(string[] args)
{
    var source = Package.Open(@"test.docx");
    var document = WordprocessingDocument.Open(source);
    HtmlConverterSettings settings = new HtmlConverterSettings();
    XElement html = HtmlConverter.ConvertToHtml(document, settings);

    Console.WriteLine(html.ToString());
    var writer = File.CreateText("test.html");
    writer.WriteLine(html.ToString());
    writer.Dispose();
    Console.ReadLine();

确保test.docx是带有某些文本的有效Word文档,否则可能会出现错误:

Make sure the test.docx is a valid word document with some text otherwise you might get an error:

指定的软件包无效.主要部分丢失了

the specified package is invalid. the main part is missing

如果运行项目,您将看到HTML看起来几乎与Word文档中的内容完全一样:

If you run the project you will see the HTML looks almost exactly like the content in the Word document:

但是,如果您尝试使用带有图片或链接的Word文档,则会发现它们已丢失或损坏.

However if you try a Word Document with pictures or links you will notice they're missing or broken.

此CodeProject文章解决了以下问题: https://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx

This CodeProject article addresses these issues: https://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx

我不得不更改static Uri FixUri(string brokenUri)方法以返回Uri,并且添加了用户友好的错误消息.

I had to change the static Uri FixUri(string brokenUri) method to return a Uri and I added user friendly error messages.

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:\temp\MyDocWithImages.docx");
    string fullFilePath = fileInfo.FullName;
    string htmlText = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileInfo);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath,FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileInfo);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
}

public static Uri FixUri(string brokenUri)
{
    string newURI = string.Empty;
    if (brokenUri.Contains("mailto:"))
    {
        int mailToCount = "mailto:".Length;
        brokenUri = brokenUri.Remove(0, mailToCount);
        newURI = brokenUri;
    }
    else
    {
        newURI = " ";
    }
    return new Uri(newURI);
}

public static string ParseDOCX(FileInfo fileInfo)
{
    try
    {
        byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wDoc =
                                        WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = fileInfo.FullName;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? fileInfo.FullName;

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

您可能需要使用System.Drawing.Common NuGet程序包才能使用ImageFormat

You may need System.Drawing.Common NuGet package to use ImageFormat

现在我们可以获取图像了:

Now we can get images:

如果只想在Web浏览器中显示Word .docx文件,最好不要将HTML转换为PDF,因为那样会大大增加带宽.您可以使用VPP技术将HTML存储在文件系统,云或dB中.

If you only want to show Word .docx files in a web browser its better not to convert the HTML to PDF as that will significantly increase bandwidth. You could store the HTML in a file system, cloud, or in a dB using a VPP Technology.


HTML到PDF

下一步,我们需要将HTML传递给DinkToPdf.下载DinkToPdf(90 MB)解决方案.构建解决方案-恢复所有软件包和编译解决方案将花费一些时间.

Next thing we need to do is pass the HTML to DinkToPdf. Download the DinkToPdf (90 MB) solution. Build the solution - it will take a while for all the packages to be restored and for the solution to Compile.

重要提示:

如果要在Linux和Windows上运行,DinkToPdf库在项目的根目录中需要libwkhtmltox.so和libwkhtmltox.dll文件.如果需要,还有一个Mac的libwkhtmltox.dylib文件.

The DinkToPdf library requires the libwkhtmltox.so and libwkhtmltox.dll file in the root of your project if you want to run on Linux and Windows. There's also a libwkhtmltox.dylib file for Mac if you need it.

这些dll位于v0.12.4文件夹中.根据您的PC(32位还是64位),将3个文件复制到DinkToPdf-master \ DinkToPfd.TestConsoleApp \ bin \ Debug \ netcoreapp1.1文件夹.

These dlls are in the v0.12.4 folder. Depending on your PC, 32 or 64 bit, copy the 3 files to the DinkToPdf-master\DinkToPfd.TestConsoleApp\bin\Debug\netcoreapp1.1 folder.

重要提示2:

确保在Docker映像或Linux机器上安装了libgdiplus. libwkhtmltox.so库依赖于它.

Make sure that you have libgdiplus installed in your Docker image or on your Linux machine. The libwkhtmltox.so library depends on it.

将DinkToPfd.TestConsoleApp设置为StartUp项目,并更改Program.cs文件以从使用Open-Xml-PowerTools保存的HTML文件而不是Lorium Ipsom文本中读取htmlContent.

Set the DinkToPfd.TestConsoleApp as StartUp project and change the Program.cs file to read the htmlContent from the HTML file saved with Open-Xml-PowerTools instead of the Lorium Ipsom text.

var doc = new HtmlToPdfDocument()
{
    GlobalSettings = {
        ColorMode = ColorMode.Color,
        Orientation = Orientation.Landscape,
        PaperSize = PaperKind.A4,
    },
    Objects = {
        new ObjectSettings() {
            PagesCount = true,
            HtmlContent = File.ReadAllText(@"C:\TFS\Sandbox\Open-Xml-PowerTools-abfbaac510d0d60e2f492503c60ef897247716cf\ToolsTest\test1.html"),
            WebSettings = { DefaultEncoding = "utf-8" },
            HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
            FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
        }
    }
};

Docx与PDF的结果令人印象深刻,我怀疑很多人会发现很多差异(特别是如果他们从未看过原始版本的话):

The result of the Docx vs the PDF is quite impressive and I doubt many people would pick out many differences (especially if they never see the original):

Ps.我意识到您想将.doc.docx都转换为PDF.我建议您自己进行一项服务,以使用特定的非服务器Windows/Microsoft技术将.doc转换为docx.该doc格式为二进制格式,不适用于

Ps. I realise you wanted to convert both .doc and .docx to PDF. I'd suggest making a service yourself to convert .doc to docx using a specific non-server Windows/Microsoft technology. The doc format is binary and is not intended for server side automation of office.