C# HTML Diff Algorithm


I have finally launched my first Codeplex project, very exciting :) I was inspired by writeboard.com to find some way of implementing an HTML difference viewer in an internal application I was developing. Essentially, I was looking for a way to take two blocks of HTML and compare them in a way that highlights what the differences are. This is extremely useful for CMS type systems where WYSIWYG/Textile/Wiki markup is used to populate content. In most web systems where content is authored dynamically, a history of the content is tracked over time. When collaborating with a few people, this feature is critically important. What makes it extremely useful is the capability to detect what has changed between versions. This post focuses on a project I have launched to do exactly that – track the difference between two versions of HTML markup.

The application I was building was developed on ASP .NET MVC (C#) so naturally I was looking for some C# code I could use to implement the difference algorithm. In searching, I could not find any libraries that were worth implementing. I did come across one or two command line utilities but nothing spectacular. I widened my search to other languages and came across a neat implementation in Ruby. The algorithm was developed by Nathan Herald who generously made the code available to everyone via the common MIT license.

So, I had the algorithm I was looking for, but I didn’t speak Ruby! This was an excellent opportunity to roll up my sleeves and learn some Ruby so I fired up my browser, downloaded the Windows one-click installer and got a simple environment up and running. After toying with code for a bit, scratching my head at one or two alien Ruby constructs I got the gist of how things worked. I fired up Visual Studio, created a new project and began the process of porting the algorithm. I must admit that the process was relatively painless and I got something working in a few hours. It took about another hour or two to iron out some bugs I picked up but essentially, in a relatively short space of time, I had the C# diff library that I was originally looking for! Below is a demo of how it is used, followed by one or two screenshots demonstrating the functionality when rendered to your browser.

            string oldText = @"<p>This is some sample text to demonstrate the capability of the <strong>HTML diff tool</strong>.</p>
                                <p>It is based on the Ruby implementation found <a href='http://github.com/myobie/htmldiff'>here</a>. Note how the link has no tooltip</p>
                                <table cellpadding='0' cellspacing='0'>
                                <tr><td>Some sample text</td><td>Some sample value</td></tr>
                                <tr><td>Data 1 (this row will be removed)</td><td>Data 2</td></tr>
                                </table>";

            string newText = @"<p>This is some sample text to demonstrate the awesome capabilities of the <strong>HTML diff tool</strong>.</p><br/><br/>Extra spacing here that was not here before.
                                <p>It is based on the Ruby implementation found <a title='Cool tooltip' href='http://github.com/myobie/htmldiff'>here</a>. Note how the link has a tooltip now and the HTML diff algorithm has preserved formatting.</p>
                                <table cellpadding='0' cellspacing='0'>
                                <tr><td>Some sample <strong>bold text</strong></td><td>Some sample value</td></tr>
                                </table>";

            HtmlDiff diffHelper = new HtmlDiff(oldText, newText);
            string diffOutput = diffHelper.Build();

Using the sample web application provided with the project in Codeplex, the following is rendered based on the code above:

Old HTML

Old HTML

Updated HTML

Updated HTML

HTML diff output

HTML diff output

You can see that the algorithm as originally developed takes care of the nasty HTML parsing to figure out how to highlight the differences. The changes are marked up using “ins” and “del” tags. You can easily style these tags as I have done. The CSS below is responsible for rendering the differences as per the example.

ins {
	background-color: #cfc;
	text-decoration: none;
}

del {
	color: #999;
	background-color:#FEC8C8;
}

I hope you find the library useful. I wish I had more time to add tests and more documentation to the Codeplex project, but for now I think the implementation is reasonably solid and easy to follow. If you spot any bugs, let me know and I’ll try and attend to them. Given that I am not responsible for the original implementation as developed in Ruby, it might be a bit tricky to solve some of the fundamental issues with the algorithm but I will certainly have a crack at it since I have quite a good understanding of how it works after porting it.

Link to C# implementation: http://htmldiff.codeplex.com
Link to Ruby implementation: http://github.com/myobie/htmldiff

, , ,

  1. #1 by Tim on December 28, 2009 - 9:48 pm

    Very nice, and just what I was looking for!

    Great work.

  2. #2 by Jason on January 22, 2010 - 7:35 pm

    Hi Rohland: Thanks for the great library. Is there any way that your solution could be ported to .Net 2 (ie without Linq)?

  3. #3 by Rohland on January 23, 2010 - 12:58 pm

    Jason :

    Hi Rohland: Thanks for the great library. Is there any way that your solution could be ported to .Net 2 (ie without Linq)?

    It shouldn’t take too long. Not much of what is there really relies on Linq. If you download the source code you should be able to convert it pretty easily. For the cases where anonymous delegates are used, I would suggest replacing the anonymous delegate calls with custom delegates.

  4. #4 by Yannick Desjardins on March 26, 2010 - 5:55 pm

    Hi, have you made any more work on this subject? Is it solid enough for commercial integration? E-Mail me, I would like to discuss licensing this code.

    Thanks.

  5. #5 by Rohland on April 2, 2010 - 8:46 am

    There have been a few minor enhancements to the project hosted on Codeplex. In terms of licensing, feel free to use it in your commercial application as per the MIT license included with the download. Good luck!

  6. #6 by shailesh on May 14, 2010 - 8:43 am

    Hi can you help. Your code is working fine. But it is taking too much time for lengthy files.

    Thanks
    Shailesh

  7. #7 by Len D'Alberti on June 2, 2010 - 5:33 pm

    hi Rohland – other than the fact that I need to implement this in Java, it’s exactly what I was looking for.

    any ideas/hints on how to go about creating a Java implementation?

    -Len

  8. #8 by admax on June 2, 2010 - 7:11 pm

    If first input is “text” and second input is “text” then difference will be

    text

    Is this a bug?

  9. #9 by Alan Guégan on June 3, 2010 - 11:36 am

    This project is brillant. The only problem for me is that it does not group modifications as larger groups of text. For human readability, the “smaller differences found” option is sometimes not the best one :-)
    I can’t figure what modification could be done to improve the algorithm, unfortunately…

  10. #10 by Alan Guégan on June 4, 2010 - 2:49 pm

    Finally i figured out how to group modifications (operations). If you are interested…

  11. #11 by Rohland on June 5, 2010 - 12:51 pm

    Alan Guégan :

    Finally i figured out how to group modifications (operations). If you are interested…

    Sounds interesting. Did you make any changes to the original source code? Perhaps you could submit a patch with an overloaded function with some kind of flag to set whether modifications are grouped.

  12. #12 by Rohland on June 5, 2010 - 12:52 pm

    shailesh :

    Hi can you help. Your code is working fine. But it is taking too much time for lengthy files.

    Thanks
    Shailesh

    I haven’t done much in the way of performance optimisation. I’ll look into it when I get a chance.

  13. #13 by tats on June 7, 2010 - 11:04 am

    Hi, gr8 program.
    Can you please help me out, i have tried it but it doesn’t highlight tag difference on text difference it works. For example if its
    text1=”word”
    text2=”word
    It does not highlight the difference. otherwise it works fine.
    Thanks

  14. #14 by Rohland on June 7, 2010 - 11:11 am

    tats :

    Hi, gr8 program.
    Can you please help me out, i have tried it but it doesn’t highlight tag difference on text difference it works. For example if its
    text1=”word”
    text2=”word
    It does not highlight the difference. otherwise it works fine.
    Thanks

    Hmm, that should work. Please can you send the inputs that you are using (i.e. the two strings you are comparing). If it is simply a presentational change it should highlight the text in orange. Have you applied the relevant style sheet classes?

  15. #15 by tats on June 7, 2010 - 5:27 pm

    Sorry for the delay, I tried it, gives color highlight on text change but not on same text changed to italic or bold.
    here is a sample code-

    oldText = @”Who Can? Individual research projects can be undertaken.”;

    newText = @”Who Can? Individual research projects can be undertaken.”;

  16. #16 by tats on June 7, 2010 - 5:33 pm

    exact code –

    oldText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual research projects can be undertaken.”;

    newText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual <span style=’font-style: italic’>research</span> projects <span style=’font-weight: bold’>can</span> be undertaken.”;

  17. #17 by Rohland on June 7, 2010 - 6:25 pm

    tats :

    exact code –

    oldText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual research projects can be undertaken.”;

    newText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual <span style=’font-style: italic’>research</span> projects <span style=’font-weight: bold’>can</span> be undertaken.”;

    Unfortunately, this scenario is not supported right now. As it stands, it can only detect style differences if the styles are implemented using tags such as i,b,strong,u etc… In time I may implement a feature to detect style changes based on the inline style info, although this could be complicated due to CSS inheritance.

  18. #18 by tats on June 7, 2010 - 6:34 pm

    True.. I understand. Actually input appears like this because of the RichTextBox control i have used and create the html on its own. I mean, its little out of control.
    Thanks anyway, your code is really helpful.
    Thanks for your instant reply.
    Just in case you update this library, plz let me know. Thanks a lot :)

  19. #19 by tats on June 12, 2010 - 5:43 pm

    Hi, I was wondering if i add one more array item to your
    string[] specialCaseOpeningTags = new string[] {….., “\\s]+” }
    and specialCaseClosingTags = “”
    It works, but sometime it doesn’t gives correct result. Do you see any mistake in this?

  20. #20 by tats on June 12, 2010 - 5:45 pm

    string[] specialCaseOpeningTags = new string[] {….., “<span[\\:bold|:italic|:underline\\>\\s]+” }
    and specialCaseClosingTags = “<span>”

  21. #21 by Rohland on June 13, 2010 - 1:37 pm

    tats :

    string[] specialCaseOpeningTags = new string[] {….., “<span[\\:bold|:italic|:underline\\>\\s]+” }
    and specialCaseClosingTags = “<span>”

    Give this a try:<span[^<]+(italic|bold|underline)[^<]+>

  22. #22 by tats on June 14, 2010 - 9:46 am

    It worked :)

    I’ll try on different types html content now..
    Thanks a lot !

  23. #23 by tats on June 16, 2010 - 9:24 am

    hmm… there is a problem,
    If we have only <span> or <span with some attributes other than (italic|bold|underline), it goes wrong. Also, it should check at least one attribute matching from (italic|bold|underline).

    The input,output values are as below -
    oldtext = “<div style=’padding-left: 12px; padding-right: 12px’><span style=’left:auto’><span style=’font-weight: bold’>Who Can?</span> <br /><br />Individual research projects can be undertaken.</span>”

    newtext = “<div style=’padding-left: 12px; padding-right: 12px’>Who Can? <br /><br />Individual <i>research</i> projects <span style=’text-decoration: underline’>can</span> be.”

    Result = “<div style=’padding-left: 12px; padding-right: 12px’><span style=’font-weight: bold’><ins class=’mod’>Who Can?</ins> <br /><br />Individual <i><ins class=’mod’>research</ins></i> projects <span style=’text-decoration: underline’><ins class=’mod’>can</ins></span> <del class=’diffmod’>be undertaken.</del></span><ins class=’diffmod’>be.</ins>”

  24. #24 by Jon on July 6, 2010 - 12:05 pm

    This flipping rules! I needed to compare the difference between two asp.net pages and display it in a sensible way. One nice easy class (which I have ported over to vb.net), and it just works in a couple of line of code…

    All I need now is to add image diff, but that is defiantly for another day!

  25. #25 by Alok on July 25, 2010 - 3:28 am

    I am comparing the following two, and it seems output has a bug:

    File 1:

    Table text unchanged
    Table text before
    Table text before

    Row will be deleted

    File 2:

    Table text unchanged
    Table text after
    Table text after

    Output:

    Table text unchanged
    Table text beforeafter
    Table text before


    Row will be deletedafter

    Why is there an “after” after “Row will be deleted”? It should be before!

  26. #26 by Alok on July 25, 2010 - 3:30 am

    I like your CSS example. How do I incorporate that into the output html file? Are the two input files supposed to carry that?

  27. #27 by tats on August 24, 2010 - 7:28 am

    Can you please help, it is not closing one tag. So it is highlighting everything which comes after that.

    Old Text = On this website. This is a commercial company.

    New Text = <span style=’style:italic’>On this website endtext.</span> This is a commercial company.

    Output = <span style=’style:italic’><ins class=’mod’>On this <del class=’diffmod’>website.</del><ins

    class=’diffmod’>website endtext.</ins></span> This is a commercial company.

    Thanks

  28. #28 by tats on August 24, 2010 - 7:29 am

    Can you please help, it is not closing one <ins> tag. So it is highlighting everything which comes after that.

    Old Text = On this website. This is a commercial company.

    New Text = <span style=’style:italic’>On this website endtext.</span> This is a commercial company.

    Output = <span style=’style:italic’><ins class=’mod’>On this <del class=’diffmod’>website.</del><ins

    class=’diffmod’>website endtext.</ins></span> This is a commercial company.

    Thanks

  29. #29 by tats on August 24, 2010 - 2:01 pm

    Hi,

    Please leave above two queries

    Can you please help, it is not closing one <ins> tag. So it is highlighting everything which comes after that.

    Old Text = On this website. This is a commercial company.

    New Text = <i>On this website. New line added.</i> This is a commercial company.

    Output = <i><ins class=’mod’>On this website.<ins class=’diffins’> New line added.</ins></i> This is a commercial company.

    Thanks

(will not be published)