Posts Tagged ‘html’

C# HTML Diff Algorithm

I have finally launched my first Codeplex project, very exciting :) I was inspired by writeboard.com to find some way of implementing an HTML difference viewer in an internal application I was developing. Essentially, I was looking for a way to take two blocks of HTML and compare them in a way that highlights what the differences are. This is extremely useful for CMS type systems where WYSIWYG/Textile/Wiki markup is used to populate content. In most web systems where content is authored dynamically, a history of the content is tracked over time. When collaborating with a few people, this feature is critically important. What makes it extremely useful is the capability to detect what has changed between versions. This post focuses on a project I have launched to do exactly that – track the difference between two versions of HTML markup.

The application I was building was developed on ASP .NET MVC (C#) so naturally I was looking for some C# code I could use to implement the difference algorithm. In searching, I could not find any libraries that were worth implementing. I did come across one or two command line utilities but nothing spectacular. I widened my search to other languages and came across a neat implementation in Ruby. The algorithm was developed by Nathan Herald who generously made the code available to everyone via the common MIT license.

So, I had the algorithm I was looking for, but I didn’t speak Ruby! This was an excellent opportunity to roll up my sleeves and learn some Ruby so I fired up my browser, downloaded the Windows one-click installer and got a simple environment up and running. After toying with code for a bit, scratching my head at one or two alien Ruby constructs I got the gist of how things worked. I fired up Visual Studio, created a new project and began the process of porting the algorithm. I must admit that the process was relatively painless and I got something working in a few hours. It took about another hour or two to iron out some bugs I picked up but essentially, in a relatively short space of time, I had the C# diff library that I was originally looking for! Below is a demo of how it is used, followed by one or two screenshots demonstrating the functionality when rendered to your browser.

            string oldText = @"<p>This is some sample text to demonstrate the capability of the <strong>HTML diff tool</strong>.</p>
                                <p>It is based on the Ruby implementation found <a href='http://github.com/myobie/htmldiff'>here</a>. Note how the link has no tooltip</p>
                                <table cellpadding='0' cellspacing='0'>
                                <tr><td>Some sample text</td><td>Some sample value</td></tr>
                                <tr><td>Data 1 (this row will be removed)</td><td>Data 2</td></tr>
                                </table>";

            string newText = @"<p>This is some sample text to demonstrate the awesome capabilities of the <strong>HTML diff tool</strong>.</p><br/><br/>Extra spacing here that was not here before.
                                <p>It is based on the Ruby implementation found <a title='Cool tooltip' href='http://github.com/myobie/htmldiff'>here</a>. Note how the link has a tooltip now and the HTML diff algorithm has preserved formatting.</p>
                                <table cellpadding='0' cellspacing='0'>
                                <tr><td>Some sample <strong>bold text</strong></td><td>Some sample value</td></tr>
                                </table>";

            HtmlDiff diffHelper = new HtmlDiff(oldText, newText);
            string diffOutput = diffHelper.Build();

Using the sample web application provided with the project in Codeplex, the following is rendered based on the code above:

Old HTML

Old HTML

Updated HTML

Updated HTML

HTML diff output

HTML diff output

You can see that the algorithm as originally developed takes care of the nasty HTML parsing to figure out how to highlight the differences. The changes are marked up using “ins” and “del” tags. You can easily style these tags as I have done. The CSS below is responsible for rendering the differences as per the example.

ins {
	background-color: #cfc;
	text-decoration: none;
}

del {
	color: #999;
	background-color:#FEC8C8;
}

I hope you find the library useful. I wish I had more time to add tests and more documentation to the Codeplex project, but for now I think the implementation is reasonably solid and easy to follow. If you spot any bugs, let me know and I’ll try and attend to them. Given that I am not responsible for the original implementation as developed in Ruby, it might be a bit tricky to solve some of the fundamental issues with the algorithm but I will certainly have a crack at it since I have quite a good understanding of how it works after porting it.

Link to C# implementation: http://htmldiff.codeplex.com
Link to Ruby implementation: http://github.com/myobie/htmldiff