libtidy Introduction

libtidy is the library version of HTML Tidy. In fact, Tidy is libtidy; the console application is a very simple C application that links against libtidy. It’s what powers Tidy, mod-tidy, and countless other applications that perform tidying.

Please note that this content is adapted from the original on SourceForge.

Design

Design factors

  • libtidy is easy to integrate. Because of the near universal adoption of C linkage, a C interface may be called from a great number of programming languages.

  • libtidy was designed to use opaque types in the public interface. This allows the application to just pass an integer around and the need to transform data types in different languages is minimized. As a results it’s straight-forward to write very thin library wrappers for C++, Pascal, and COM/ATL.

  • libtidy eats its own dogfood. HTML Tidy links directly to libtidy.

  • libtidy is Thread Safe and Re-entrant. Because there are many uses for HTML Tidy - from content validation, content scraping, conversion to XHTML - it was important to make libtidy run reasonably well within server applications as well as client side.

  • libtidy uses adaptable I/O. As part of the larger integration strategy it was decided to fully abstract all I/O. This means a (relatively) clean separation between character encoding processing and shovelling bytes back and forth. Internally, the library reads from sources and writes to sinks. This abstraction is used for both markup and configuration “files”. Concrete implementations are provided for file and memory I/O, and new sources and sinks may be provided via the public interface.

Implement

Implement libtidy

Once you’ve built libtidy following the README instructions for cmake in HTML Tidy’s repository, you can get started using libtidy. cmake will have built both the console application and the library for you.

Perhaps the easiest way to understand how to use libtidy is to see a simple program that implements it. Such a simple program follows in the next section, and don’t forget that you can also study console/tidy.c, too.

Before we look at the code, it’s important to understand that API functions that return an integer almost universally adhere to the following convention:

0 == Success

Good to go.

1 == Warnings, but no errors

Check the error buffer or track error messages for details.

2 == Errors (and maybe warnings)

By default, Tidy will not produce output. You can force output with the TidyForceOutput option. As with warnings, check the error buffer or track error messages for details.

<0 == Severe error

Usually value equals -errno. See errno.h.

Also, by default, warning and error messages are sent to stderr. You can redirect diagnostic output using either tidySetErrorFile() or tidySetErrorBuffer(). See tidy.h for details.

Sample

Sample Program

#include <tidy.h>
#include <tidybuffio.h>
#include <stdio.h>
#include <errno.h>

int main(int argc, char **argv )
{
  const char* input = "<title>Foo</title><p>Foo!";
  TidyBuffer output = {0};
  TidyBuffer errbuf = {0};
  int rc = -1;
  Bool ok;

  TidyDoc tdoc = tidyCreate();                     // Initialize "document"
  printf( "Tidying:\t%s\n", input );

  ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes );  // Convert to XHTML
  if ( ok )
    rc = tidySetErrorBuffer( tdoc, &errbuf );      // Capture diagnostics
  if ( rc >= 0 )
    rc = tidyParseString( tdoc, input );           // Parse the input
  if ( rc >= 0 )
    rc = tidyCleanAndRepair( tdoc );               // Tidy it up!
  if ( rc >= 0 )
    rc = tidyRunDiagnostics( tdoc );               // Kvetch
  if ( rc > 1 )                                    // If error, force output.
    rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
  if ( rc >= 0 )
    rc = tidySaveBuffer( tdoc, &output );          // Pretty Print

  if ( rc >= 0 )
  {
    if ( rc > 0 )
      printf( "\nDiagnostics:\n\n%s", errbuf.bp );
    printf( "\nAnd here is the result:\n\n%s", output.bp );
  }
  else
    printf( "A severe error (%d) occurred.\n", rc );

  tidyBufFree( &output );
  tidyBufFree( &errbuf );
  tidyRelease( tdoc );
  return rc;
}
App Notes

Application Notes

Of course, there are functions to parse and save both markup and configuration files. For the adventurous, it is possible to create new input sources and output sinks. For example, a URL source could pull the markup from a given URL.

It is also worth remembering that an application may instantiate any number of document and buffer objects. They are fairly cheap to initialize and destroy (just memory allocation and zeroing, really), so they may be created and destroyed locally, as needed. There is no problem keeping them around a while for keeping state. For example, a server app might keep a global document as a master configuration. As documents are parsed, they can copy their configuration data from the master instance. See tidyOptCopyConfig(). If the master copy is initialized at startup, no synchronization is necessary.

API Dox

API Docs

Autogenerated API documentation is available. Because it’s autogenerated from documentation in the code using doxygen, it’s only as good as our source code commenting is. If necessary, you can become familiar with the source code.