C# HTML Diff Algorithm

Tags: , , ,
Posted in Development

I have finally launched my first Codeplex project, very exciting :) I was inspired by writeboard.com to find some way of implementing an HTML difference viewer in an internal application I was developing. Essentially, I was looking for a way to take two blocks of HTML and compare them in a way that highlights what the differences are. This is extremely useful for CMS type systems where WYSIWYG/Textile/Wiki markup is used to populate content. In most web systems where content is authored dynamically, a history of the content is tracked over time. When collaborating with a few people, this feature is critically important. What makes it extremely useful is the capability to detect what has changed between versions. This post focuses on a project I have launched to do exactly that – track the difference between two versions of HTML markup.

The application I was building was developed on ASP .NET MVC (C#) so naturally I was looking for some C# code I could use to implement the difference algorithm. In searching, I could not find any libraries that were worth implementing. I did come across one or two command line utilities but nothing spectacular. I widened my search to other languages and came across a neat implementation in Ruby. The algorithm was developed by Nathan Herald who generously made the code available to everyone via the common MIT license.

So, I had the algorithm I was looking for, but I didn’t speak Ruby! This was an excellent opportunity to roll up my sleeves and learn some Ruby so I fired up my browser, downloaded the Windows one-click installer and got a simple environment up and running. After toying with code for a bit, scratching my head at one or two alien Ruby constructs I got the gist of how things worked. I fired up Visual Studio, created a new project and began the process of porting the algorithm. I must admit that the process was relatively painless and I got something working in a few hours. It took about another hour or two to iron out some bugs I picked up but essentially, in a relatively short space of time, I had the C# diff library that I was originally looking for! Below is a demo of how it is used, followed by one or two screenshots demonstrating the functionality when rendered to your browser.

            string oldText = @"<p>This is some sample text to demonstrate the capability of the <strong>HTML diff tool</strong>.</p>
                                <p>It is based on the Ruby implementation found <a href='http://github.com/myobie/htmldiff'>here</a>. Note how the link has no tooltip</p>
                                <table cellpadding='0' cellspacing='0'>
                                <tr><td>Some sample text</td><td>Some sample value</td></tr>
                                <tr><td>Data 1 (this row will be removed)</td><td>Data 2</td></tr>
                                </table>";

            string newText = @"<p>This is some sample text to demonstrate the awesome capabilities of the <strong>HTML diff tool</strong>.</p><br/><br/>Extra spacing here that was not here before.
                                <p>It is based on the Ruby implementation found <a title='Cool tooltip' href='http://github.com/myobie/htmldiff'>here</a>. Note how the link has a tooltip now and the HTML diff algorithm has preserved formatting.</p>
                                <table cellpadding='0' cellspacing='0'>
                                <tr><td>Some sample <strong>bold text</strong></td><td>Some sample value</td></tr>
                                </table>";

            HtmlDiff diffHelper = new HtmlDiff(oldText, newText);
            string diffOutput = diffHelper.Build();

Using the sample web application provided with the project in Codeplex, the following is rendered based on the code above:

Old HTML

Old HTML

Updated HTML

Updated HTML

HTML diff output

HTML diff output

You can see that the algorithm as originally developed takes care of the nasty HTML parsing to figure out how to highlight the differences. The changes are marked up using “ins” and “del” tags. You can easily style these tags as I have done. The CSS below is responsible for rendering the differences as per the example.

ins {
	background-color: #cfc;
	text-decoration: none;
}

del {
	color: #999;
	background-color:#FEC8C8;
}

I hope you find the library useful. I wish I had more time to add tests and more documentation to the Codeplex project, but for now I think the implementation is reasonably solid and easy to follow. If you spot any bugs, let me know and I’ll try and attend to them. Given that I am not responsible for the original implementation as developed in Ruby, it might be a bit tricky to solve some of the fundamental issues with the algorithm but I will certainly have a crack at it since I have quite a good understanding of how it works after porting it.

Link to C# implementation: http://htmldiff.codeplex.com
Link to Ruby implementation: http://github.com/myobie/htmldiff


44 Responses

  1. Tim says:

    Very nice, and just what I was looking for!

    Great work.

  2. Jason says:

    Hi Rohland: Thanks for the great library. Is there any way that your solution could be ported to .Net 2 (ie without Linq)?

  3. Rohland says:

    Jason :

    Hi Rohland: Thanks for the great library. Is there any way that your solution could be ported to .Net 2 (ie without Linq)?

    It shouldn’t take too long. Not much of what is there really relies on Linq. If you download the source code you should be able to convert it pretty easily. For the cases where anonymous delegates are used, I would suggest replacing the anonymous delegate calls with custom delegates.

  4. Hi, have you made any more work on this subject? Is it solid enough for commercial integration? E-Mail me, I would like to discuss licensing this code.

    Thanks.

  5. Rohland says:

    There have been a few minor enhancements to the project hosted on Codeplex. In terms of licensing, feel free to use it in your commercial application as per the MIT license included with the download. Good luck!

  6. shailesh says:

    Hi can you help. Your code is working fine. But it is taking too much time for lengthy files.

    Thanks
    Shailesh

  7. hi Rohland – other than the fact that I need to implement this in Java, it’s exactly what I was looking for.

    any ideas/hints on how to go about creating a Java implementation?

    -Len

  8. admax says:

    If first input is “text” and second input is “text” then difference will be

    text

    Is this a bug?

  9. Alan Guégan says:

    This project is brillant. The only problem for me is that it does not group modifications as larger groups of text. For human readability, the “smaller differences found” option is sometimes not the best one :-)
    I can’t figure what modification could be done to improve the algorithm, unfortunately…

  10. Alan Guégan says:

    Finally i figured out how to group modifications (operations). If you are interested…

  11. Rohland says:

    Alan Guégan :

    Finally i figured out how to group modifications (operations). If you are interested…

    Sounds interesting. Did you make any changes to the original source code? Perhaps you could submit a patch with an overloaded function with some kind of flag to set whether modifications are grouped.

  12. Rohland says:

    shailesh :

    Hi can you help. Your code is working fine. But it is taking too much time for lengthy files.

    Thanks
    Shailesh

    I haven’t done much in the way of performance optimisation. I’ll look into it when I get a chance.

  13. tats says:

    Hi, gr8 program.
    Can you please help me out, i have tried it but it doesn’t highlight tag difference on text difference it works. For example if its
    text1=”word”
    text2=”word
    It does not highlight the difference. otherwise it works fine.
    Thanks

  14. Rohland says:

    tats :

    Hi, gr8 program.
    Can you please help me out, i have tried it but it doesn’t highlight tag difference on text difference it works. For example if its
    text1=”word”
    text2=”word
    It does not highlight the difference. otherwise it works fine.
    Thanks

    Hmm, that should work. Please can you send the inputs that you are using (i.e. the two strings you are comparing). If it is simply a presentational change it should highlight the text in orange. Have you applied the relevant style sheet classes?

  15. tats says:

    Sorry for the delay, I tried it, gives color highlight on text change but not on same text changed to italic or bold.
    here is a sample code-

    oldText = @”Who Can? Individual research projects can be undertaken.”;

    newText = @”Who Can? Individual research projects can be undertaken.”;

  16. tats says:

    exact code –

    oldText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual research projects can be undertaken.”;

    newText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual <span style=’font-style: italic’>research</span> projects <span style=’font-weight: bold’>can</span> be undertaken.”;

  17. Rohland says:

    tats :

    exact code –

    oldText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual research projects can be undertaken.”;

    newText = @”<div style=’padding-left: 12px; padding-right: 12px’><strong>Who Can?</strong> <br /><br />Individual <span style=’font-style: italic’>research</span> projects <span style=’font-weight: bold’>can</span> be undertaken.”;

    Unfortunately, this scenario is not supported right now. As it stands, it can only detect style differences if the styles are implemented using tags such as i,b,strong,u etc… In time I may implement a feature to detect style changes based on the inline style info, although this could be complicated due to CSS inheritance.

  18. tats says:

    True.. I understand. Actually input appears like this because of the RichTextBox control i have used and create the html on its own. I mean, its little out of control.
    Thanks anyway, your code is really helpful.
    Thanks for your instant reply.
    Just in case you update this library, plz let me know. Thanks a lot :)

  19. tats says:

    Hi, I was wondering if i add one more array item to your
    string[] specialCaseOpeningTags = new string[] {….., “\\s]+” }
    and specialCaseClosingTags = “”
    It works, but sometime it doesn’t gives correct result. Do you see any mistake in this?

  20. tats says:

    string[] specialCaseOpeningTags = new string[] {….., “<span[\\:bold|:italic|:underline\\>\\s]+” }
    and specialCaseClosingTags = “<span>”

  21. Rohland says:

    tats :

    string[] specialCaseOpeningTags = new string[] {….., “<span[\\:bold|:italic|:underline\\>\\s]+” }
    and specialCaseClosingTags = “<span>”

    Give this a try:<span[^<]+(italic|bold|underline)[^<]+>

  22. tats says:

    It worked :)

    I’ll try on different types html content now..
    Thanks a lot !

  23. tats says:

    hmm… there is a problem,
    If we have only <span> or <span with some attributes other than (italic|bold|underline), it goes wrong. Also, it should check at least one attribute matching from (italic|bold|underline).

    The input,output values are as below -
    oldtext = “<div style=’padding-left: 12px; padding-right: 12px’><span style=’left:auto’><span style=’font-weight: bold’>Who Can?</span> <br /><br />Individual research projects can be undertaken.</span>”

    newtext = “<div style=’padding-left: 12px; padding-right: 12px’>Who Can? <br /><br />Individual <i>research</i> projects <span style=’text-decoration: underline’>can</span> be.”

    Result = “<div style=’padding-left: 12px; padding-right: 12px’><span style=’font-weight: bold’><ins class=’mod’>Who Can?</ins> <br /><br />Individual <i><ins class=’mod’>research</ins></i> projects <span style=’text-decoration: underline’><ins class=’mod’>can</ins></span> <del class=’diffmod’>be undertaken.</del></span><ins class=’diffmod’>be.</ins>”

  24. Jon says:

    This flipping rules! I needed to compare the difference between two asp.net pages and display it in a sensible way. One nice easy class (which I have ported over to vb.net), and it just works in a couple of line of code…

    All I need now is to add image diff, but that is defiantly for another day!

  25. Alok says:

    I am comparing the following two, and it seems output has a bug:

    File 1:

    Table text unchanged
    Table text before
    Table text before

    Row will be deleted

    File 2:

    Table text unchanged
    Table text after
    Table text after

    Output:

    Table text unchanged
    Table text beforeafter
    Table text before


    Row will be deletedafter

    Why is there an “after” after “Row will be deleted”? It should be before!

  26. Alok says:

    I like your CSS example. How do I incorporate that into the output html file? Are the two input files supposed to carry that?

  27. tats says:

    Can you please help, it is not closing one tag. So it is highlighting everything which comes after that.

    Old Text = On this website. This is a commercial company.

    New Text = <span style=’style:italic’>On this website endtext.</span> This is a commercial company.

    Output = <span style=’style:italic’><ins class=’mod’>On this <del class=’diffmod’>website.</del><ins

    class=’diffmod’>website endtext.</ins></span> This is a commercial company.

    Thanks

  28. tats says:

    Can you please help, it is not closing one <ins> tag. So it is highlighting everything which comes after that.

    Old Text = On this website. This is a commercial company.

    New Text = <span style=’style:italic’>On this website endtext.</span> This is a commercial company.

    Output = <span style=’style:italic’><ins class=’mod’>On this <del class=’diffmod’>website.</del><ins

    class=’diffmod’>website endtext.</ins></span> This is a commercial company.

    Thanks

  29. tats says:

    Hi,

    Please leave above two queries

    Can you please help, it is not closing one <ins> tag. So it is highlighting everything which comes after that.

    Old Text = On this website. This is a commercial company.

    New Text = <i>On this website. New line added.</i> This is a commercial company.

    Output = <i><ins class=’mod’>On this website.<ins class=’diffins’> New line added.</ins></i> This is a commercial company.

    Thanks

  30. tats says:

    How to highlight url link change if only href=”” value is changed? :(

  31. sdafasdf says:

    Rohland :

    Jason :
    Hi Rohland: Thanks for the great library. Is there any way that your solution could be ported to .Net 2 (ie without Linq)?

    It shouldn’t take too long. Not much of what is there really relies on Linq. If you download the source code you should be able to convert it pretty easily. For the cases where anonymous delegates are used, I would suggest replacing the anonymous delegate calls with custom delegates.

  32. Josh says:

    I have a table with 4 rows and 3 columns:

    A1 B1 C1
    A2 B2 C2
    A3 B3 C3
    A4 B4 C4

    When you delete row 4 and add a column D, the cell in the new column that is located immediately before the deleted row gets placed in the deleted row.

    A1 B1 C1 D1
    A2 B2 C2 D2
    A3 B3 C3
    A4 B4 C4D3

    Any ideas on how to fix this?

  33. GP says:

    Hi Rohland,

    I was wondering if you’ve continued your work on this library. The version on CodePlex is dated Sat Oct 31 2009.

    If you have worked on it I would appreciate if you could either send me an updated version or publish it.

    Thanks.

  34. Rohland says:

    I’ve migrated the repository to GitHub. Please log any issues there.

    https://github.com/Rohland/htmldiff.net

    Cheers.

  35. roboasimo says:

    Issue: Project failed to preview the except output.
    observe the following scenario:

    Base file:
    Abc

    Modified file:
    abc

    Output:
    Abc10pt”>abc

    10pt”> should be not there.

    This appear every place where Modified text is attached with tags.

  36. VikasGoel says:

    VikasGoel :
    Hi,
    I am facing problem in 2 cases.
    Ex: We have a text string as
    “This is our line.”
    and now I add “new word” in upper line.
    Ex: new line is “This is our line new word.” Then it it not giving the correct result. It first create a overline on “line” and then create the “line new word”.

    Second problem with comma(,). if i add any word before comma(,) word then it delete first and then create a new word.
    Pls suggest me.
    Thanks

  37. Rohland says:

    The issue with punctuation after words has been resolved. Please log any further issues here: https://github.com/Rohland/htmldiff.net/issues

  38. Hi Rohland,

    Could you please have a look at the issue I have posted there? https://github.com/Rohland/htmldiff.net/issues/5

    Any help is much appreciated
    Thank you in advance!

    Best wishes,
    Dmitry

  39. Code2 says:

    Thank you for this project.
    I’m wondering if it’s possible to sepearate the old and the new content in diff result? like side-by-side.

  40. kD says:

    Do you have any support for determining the differences between option tags? i.e dropdowns?

    Any points would be welcome. If i come up with a solution i’ll contribute it back to you

  41. Ash W says:

    Legend!
    A gift that you have provided the source code for anyone to modify it and a decent foundation too.

    So weird that MS hadn’t provided a decent solution in .NET

    Thanks again

    A

  42. For anyone looking for a web service to determine diffs, check out http://imnosy.com

    I dev’d it over a few months, and even played around with this C# base (but ultimately, couldn’t get it working on an Ubuntu stack)

  43. Awesome man!
    That’s exaclty what I needed. It worked at first try. I’ve loved it!
    Keep up the good work! :)

  44. Franky says:

    Hi!
    Pretty cool lib, but has some issues if a few words are replaced with different ones.
    Eg “this is a test” and “this was some test” yields substitutions “is” -> “was” and “a” -> “some” instead of “is a” -> “was some”.

Leave a Reply