如何使用iText将带图像和超链接的HTML转换为PDF?

问题描述:

我正在尝试使用 ASP中的iTextSharp将 HTML 转换为 PDF 。 NET 同时使用 MVC 的网络应用程序网络表单< img> < a> 元素包含绝对和相对网址,一些< img> 元素是 BASE64 强> 。 SO和Google搜索结果的典型答案使用泛型 HTML PDF 代码 XMLWorkerHelper 看起来像这样:

I'm trying to convert HTML to PDF using iTextSharp in an ASP.NET web application that uses both MVC, and web forms. The <img> and <a> elements have absolute and relative URLs, and some of the <img> elements are base64. Typical answers here at SO and Google search results use generic HTML to PDF code with XMLWorkerHelper that looks something like this:

using (var stringReader = new StringReader(xHtml))
{
    using (Document document = new Document())
    {
        PdfWriter writer = PdfWriter.GetInstance(document, stream);
        document.Open();
        XMLWorkerHelper.GetInstance().ParseXHtml(
            writer, document, stringReader
        );
    }
}

所以使用示例 HTML 像这样:

<div>
    <h3>HTML Works, but Broken in Converted PDF</h3>
    <div>Relative local <img>: <img src='./../content/images/kuujinbo_320-30.gif' /></div>
    <div>
        Base64 <img>:
        <img src='data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==' />
    </div>
    <div><a href='/somePage.html'>Relative local hyperlink, broken in PDF</a></div>
<div>

生成的PDF:(1)缺少所有图片,(2)所有包含相对网址的超链接都已损坏并使用文件URI方案 file /// XXX ... )而不是指向正确的网站。

The resulting PDF: (1) is missing all images, and (2) all hyperlink(s) with relative URLs are broken and use a file URI scheme (file///XXX...) instead of pointing to the correct web site.

SO的一些答案以及谷歌搜索的其他答案建议用绝对URL替换相对URL,对于一次性案例,完全可以接受。但是,使用硬编码字符串全局替换所有< img src> < a href> 属性对于这个问题,不可接受,所以请不要发布这样的答案,因为它会被贬低。

Some answers here at SO and others from Google search recommend replacing relative URLs with absolute URLs, which is perfectly acceptable for one-off cases. However, globally replacing all <img src> and <a href> attributes with a hard-coded string is unacceptable for this question, so please do not post an answer like that, because it will accordingly be downvoted.

我正在寻找一种适用于许多不同的Web应用程序的解决方案,这些应用程序位于测试,开发和生产环境中。

Am looking for a solution that works for many different web applications residing in test, development, and production environments.

开箱即用 XMLWorker 只能理解绝对URI ,因此所描述的问题是预期行为。解析器无法自动推导 URI方案或路径,而无需其他信息。

Out of the box XMLWorker only understands absolute URIs, so the described issues are expected behavior. The parser can't automagically deduce URI schemes or paths without some additional information.

实施 ILinkProvider 修复了断开的超链接问题,并实现了 IImageProvider 修复了损坏的图像问题。由于两种实现都必须执行 URI解析,这是第一步。下面的帮助器类可以做到这一点,并且还尝试使web( ASP.NET )上下文调用(示例跟随)尽可能简单:

Implementing an ILinkProvider fixes the broken hyperlink problem, and implementing an IImageProvider fixes the broken image problem. Since both implementations must perform URI resolution, that's the first step. The following helper class does that, and also tries to make web (ASP.NET) context calls (examples follow) as simple as possible:

// resolve URIs for LinkProvider & ImageProvider
public class UriHelper
{
    /* IsLocal; when running in web context:
     * [1] give LinkProvider http[s] scheme; see CreateBase(string baseUri)
     * [2] give ImageProvider relative path starting with '/' - see:
     *     Join(string relativeUri)
     */
    public bool IsLocal { get; set; }
    public HttpContext HttpContext { get; private set; }
    public Uri BaseUri { get; private set; }

    public UriHelper(string baseUri) : this(baseUri, true) {}
    public UriHelper(string baseUri, bool isLocal)
    {
        IsLocal = isLocal;
        HttpContext = HttpContext.Current;
        BaseUri = CreateBase(baseUri);
    }

    /* get URI for IImageProvider to instantiate iTextSharp.text.Image for 
     * each <img> element in the HTML.
     */
    public string Combine(string relativeUri)
    {
        /* when running in a web context, the HTML is coming from a MVC view 
         * or web form, so convert the incoming URI to a **local** path
         */
        if (HttpContext != null && !BaseUri.IsAbsoluteUri && IsLocal)
        {
            return HttpContext.Server.MapPath(
                // Combine() checks directory traversal exploits
                VirtualPathUtility.Combine(BaseUri.ToString(), relativeUri)
            );
        }
        return BaseUri.Scheme == Uri.UriSchemeFile 
            ? Path.Combine(BaseUri.LocalPath, relativeUri)
            // for this example we're assuming URI.Scheme is http[s]
            : new Uri(BaseUri, relativeUri).AbsoluteUri;
    }

    private Uri CreateBase(string baseUri)
    {
        if (HttpContext != null)
        {   // running on a web server; need to update original value  
            var req = HttpContext.Request;
            baseUri = IsLocal
                // IImageProvider; absolute virtual path (starts with '/')
                // used to convert to local file system path. see:
                // Combine(string relativeUri)
                ? req.ApplicationPath
                // ILinkProvider; absolute http[s] URI scheme
                : req.Url.GetLeftPart(UriPartial.Authority)
                    + HttpContext.Request.ApplicationPath;
        }

        Uri uri;
        if (Uri.TryCreate(baseUri, UriKind.RelativeOrAbsolute, out uri)) return uri;

        throw new InvalidOperationException("cannot create a valid BaseUri");
    }
}

实施 ILinkProvider 非常简单,因为 UriHelper 给出了基本URI。我们只需要正确的URI方案(文件 http [s] ):

Implementing ILinkProvider is pretty simple now that UriHelper gives the base URI. We just need the correct URI scheme (file or http[s]):

// make hyperlinks with relative URLs absolute
public class LinkProvider : ILinkProvider
{
    // rfc1738 - file URI scheme section 3.10
    public const char SEPARATOR = '/';
    public string BaseUrl { get; private set; }

    public LinkProvider(UriHelper uriHelper)
    {
        var uri = uriHelper.BaseUri;
        /* simplified implementation that only takes into account:
         * Uri.UriSchemeFile || Uri.UriSchemeHttp || Uri.UriSchemeHttps
         */
        BaseUrl = uri.Scheme == Uri.UriSchemeFile
            // need trailing separator or file paths break
            ? uri.AbsoluteUri.TrimEnd(SEPARATOR) + SEPARATOR
            // assumes Uri.UriSchemeHttp || Uri.UriSchemeHttps
            : BaseUrl = uri.AbsoluteUri;
    }

    public string GetLinkRoot()
    {
        return BaseUrl;
    }
}

IImageProvider 只有要求实现单个方法, Retrieve(string src),但 Store(string src,Image img )很简单 - 注意那里的内联注释和 GetImageRootPath()

IImageProvider only requires implementing a single method, Retrieve(string src), but Store(string src, Image img) is easy - note inline comments there and for GetImageRootPath():

// handle <img> elements in HTML  
public class ImageProvider : IImageProvider
{
    private UriHelper _uriHelper;
    // see Store(string src, Image img)
    private Dictionary<string, Image> _imageCache = 
        new Dictionary<string, Image>();

    public virtual float ScalePercent { get; set; }
    public virtual Regex Base64 { get; set; }

    public ImageProvider(UriHelper uriHelper) : this(uriHelper, 67f) { }
    //              hard-coded based on general past experience ^^^
    // but call the overload to supply your own
    public ImageProvider(UriHelper uriHelper, float scalePercent)
    {
        _uriHelper = uriHelper;
        ScalePercent = scalePercent;
        Base64 = new Regex( // rfc2045, section 6.8 (alphabet/padding)
            @"^data:image/[^;]+;base64,(?<data>[a-z0-9+/]+={0,2})$",
            RegexOptions.Compiled | RegexOptions.IgnoreCase
        );
    }

    public virtual Image ScaleImage(Image img)
    {
        img.ScalePercent(ScalePercent);
        return img;
    }

    public virtual Image Retrieve(string src)
    {
        if (_imageCache.ContainsKey(src)) return _imageCache[src];

        try
        {
            if (Regex.IsMatch(src, "^https?://", RegexOptions.IgnoreCase))
            {
                return ScaleImage(Image.GetInstance(src));
            }

            Match match;
            if ((match = Base64.Match(src)).Length > 0)
            {
                return ScaleImage(Image.GetInstance(
                    Convert.FromBase64String(match.Groups["data"].Value)
                ));
            }

            var imgPath = _uriHelper.Combine(src);
            return ScaleImage(Image.GetInstance(imgPath));
        }
        // not implemented to keep the SO answer (relatively) short
        catch (BadElementException ex) { return null; }
        catch (IOException ex) { return null; }
        catch (Exception ex) { return null; }
    }

    /*
     * always called after Retrieve(string src):
     * [1] cache any duplicate <img> in the HTML source so the image bytes
     *     are only written to the PDF **once**, which reduces the 
     *     resulting file size.
     * [2] the cache can also **potentially** save network IO if you're
     *     running the parser in a loop, since Image.GetInstance() creates
     *     a WebRequest when an image resides on a remote server. couldn't
     *     find a CachePolicy in the source code
     */
    public virtual void Store(string src, Image img)
    {
        if (!_imageCache.ContainsKey(src)) _imageCache.Add(src, img);
    }

    /* XMLWorker documentation for ImageProvider recommends implementing
     * GetImageRootPath():
     * 
     * http://demo.itextsupport.com/xmlworker/itextdoc/flatsite.html#itextdoc-menu-10
     * 
     * but a quick run through the debugger never hits the breakpoint, so 
     * not sure if I'm missing something, or something has changed internally 
     * with XMLWorker....
     */
    public virtual string GetImageRootPath() { return null; }
    public virtual void Reset() { }
}

基于 XML Worker文档可以直接挂钩 ILinkProvider IImageProvider 上面的一个简单的解析器类:

Based on the XML Worker documentation it's pretty straightforward to hook the implementations of ILinkProvider and IImageProvider above into a simple parser class:

/* a simple parser that uses XMLWorker and XMLParser to handle converting 
 * (most) images and hyperlinks internally
 */
public class SimpleParser
{
    public virtual ILinkProvider LinkProvider { get; set; }
    public virtual IImageProvider ImageProvider { get; set; }

    public virtual HtmlPipelineContext HtmlPipelineContext { get; set; }
    public virtual ITagProcessorFactory TagProcessorFactory { get; set; }
    public virtual ICSSResolver CssResolver { get; set; }

    /* overloads simplfied to keep SO answer (relatively) short. if needed
     * set LinkProvider/ImageProvider after instantiating SimpleParser()
     * to override the defaults (e.g. ImageProvider.ScalePercent)
     */
    public SimpleParser() : this(null) { }
    public SimpleParser(string baseUri)
    {
        LinkProvider = new LinkProvider(new UriHelper(baseUri, false));
        ImageProvider = new ImageProvider(new UriHelper(baseUri, true));

        HtmlPipelineContext = new HtmlPipelineContext(null);

        // another story altogether, and not implemented for simplicity 
        TagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
        CssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
    }

    /*
     * when sending XHR via any of the popular JavaScript frameworks,
     * <img> tags are **NOT** always closed, which results in the 
     * infamous iTextSharp.tool.xml.exceptions.RuntimeWorkerException:
     * 'Invalid nested tag a found, expected closing tag img.' a simple
     * workaround.
     */
    public virtual string SimpleAjaxImgFix(string xHtml)
    {
        return Regex.Replace(
            xHtml,
            "(?<image><img[^>]+)(?<=[^/])>",
            new MatchEvaluator(match => match.Groups["image"].Value + " />"),
            RegexOptions.IgnoreCase | RegexOptions.Multiline
        );
    }

    public virtual void Parse(Stream stream, string xHtml)
    {
        xHtml = SimpleAjaxImgFix(xHtml);

        using (var stringReader = new StringReader(xHtml))
        {
            using (Document document = new Document())
            {
                PdfWriter writer = PdfWriter.GetInstance(document, stream);
                document.Open();

                HtmlPipelineContext
                    .SetTagFactory(Tags.GetHtmlTagProcessorFactory())
                    .SetLinkProvider(LinkProvider)
                    .SetImageProvider(ImageProvider)
                ;
                var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
                var htmlPipeline = new HtmlPipeline(HtmlPipelineContext, pdfWriterPipeline);
                var cssResolverPipeline = new CssResolverPipeline(CssResolver, htmlPipeline);

                XMLWorker worker = new XMLWorker(cssResolverPipeline, true);
                XMLParser parser = new XMLParser(worker);
                parser.Parse(stringReader);
            }
        }
    }
}

As内联注释, SimpleAjaxImgFix(字符串xHtml)专门处理 可能发送未关闭的< img> 标签的XHR,有效 HTML ,但无效 XML 打破 XMLWorker 。一个简单的解释&有关如何使用XHR和iTextSharp接收PDF或其他二进制数据的实现可在此处找到

As commented inline, SimpleAjaxImgFix(string xHtml) specifically handles XHR that may send unclosed <img> tags, which is valid HTML, but invalid XML that will break XMLWorker . A simple explanation & implementation of how to receive a PDF or other binary data with XHR and iTextSharp can be found here.

正则表达式用于 SimpleAjaxImgFix(字符串xHtml)所以任何使用(复制/粘贴?)代码的人都不需要添加另一个 nuget 包,但是 HTML 解析器,如 HtmlAgilityPack 使用,因为它转为:

A Regex was used in SimpleAjaxImgFix(string xHtml) so that anyone using (copy/paste?) the code doesn't need to add another nuget package, but a HTML parser like HtmlAgilityPack should be used, since it's turns this:

<div><img src='a.gif'><br><hr></div>

进入:

<div><img src='a.gif' /><br /><hr /></div>

只有几行代码:

var hDocument = new HtmlDocument()
{
    OptionWriteEmptyNodes = true,
    OptionAutoCloseOnEnd = true
};
hDocument.LoadHtml("<div><img src='a.gif'><br><hr></div>");
var closedTags  = hDocument.DocumentNode.WriteTo();

另外值得注意的是 - 使用 SimpleParser.Parse()上面作为通用蓝图,以额外实现自定义 ICSSResolver ITagProcessorFactory 在文档中解释

Also of note - use SimpleParser.Parse() above as a general blueprint to additionally implement a custom ICSSResolver or ITagProcessorFactory, which is explained in the documentation.

现在问题中描述的问题应该是已搞定。从 MVC行动方法调用:

Now the issues described in the question should be taken care of. Called from a MVC Action Method:

[HttpPost]  // some browsers have URL length limits
[ValidateInput(false)] // or throws HttpRequestValidationException
public ActionResult Index(string xHtml)
{
    Response.ContentType = "application/pdf";
    Response.AppendHeader(
        "Content-Disposition", "attachment; filename=test.pdf"
    );
    var simpleParser = new SimpleParser();
    simpleParser.Parse(Response.OutputStream, xHtml);

    return new EmptyResult();
}

或来自网络表格 HTML >服务器控制

or from a Web Form that gets HTML from a server control:

Response.ContentType = "application/pdf";
Response.AppendHeader("Content-Disposition", "attachment; filename=test.pdf");
using (var stringWriter = new StringWriter())
{
    using (var htmlWriter = new HtmlTextWriter(stringWriter))
    {
        ConvertControlToPdf.RenderControl(htmlWriter);
    }
    var simpleParser = new SimpleParser();
    simpleParser.Parse(Response.OutputStream, stringWriter.ToString());
}
Response.End();

或文件系统上包含超链接和图像的简单HTML文件:

or a simple HTML file with hyperlinks and images on the file system:

<h1>HTML Page 00 on Local File System</h1>
<div>
    <div>
        Relative &lt;img&gt;: <img src='Images/alt-gravatar.png' />
    </div>
    <div>
        Hyperlink to file system HTML page: 
        <a href='file-system-html-01.html'>Page 01</a>
    </div>
</div>

或来自远程网站的HTML:

or HTML from a remote web site:

<div>
    <div>
        <img width="200" alt="Wikipedia Logo"
             src="portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png">
    </div>
    <div lang="en">
        <a href="https://en.wikipedia.org/">English</a>
    </div>
    <div lang="en">
        <a href="wiki/IText">iText</a>
    </div>
</div>

从控制台运行的两个 HTML 片段以上app:

Above two HTML snippets run from a console app:

var filePaths = Path.Combine(basePath, "file-system-html-00.html");
var htmlFile = File.ReadAllText(filePaths);
var remoteUrl = Path.Combine(basePath, "wikipedia.html");
var htmlRemote = File.ReadAllText(remoteUrl);
var outputFile = Path.Combine(basePath, "filePaths.pdf");
var outputRemote = Path.Combine(basePath, "remoteUrl.pdf");

using (var stream = new FileStream(outputFile, FileMode.Create))
{
    var simpleParser = new SimpleParser(basePath);
    simpleParser.Parse(stream, htmlFile);
}
using (var stream = new FileStream(outputRemote, FileMode.Create))
{
    var simpleParser = new SimpleParser("https://wikipedia.org");
    simpleParser.Parse(stream, htmlRemote);
}

相当长的答案,但看看SO标记的问题 html pdf itextsharp ,截至撰写本文时(2016-02-23)有 776个结果,总共标记为4,063个 itextsharp - 19%

Quite a long answer, but taking a look at questions here at SO tagged html, pdf, and itextsharp, as of this writing (2016-02-23) there are 776 results against 4,063 total tagged itextsharp - that's 19%.