libtidy
Introduction
libtidy
is the library version of HTML Tidy. In fact, Tidy is
libtidy
; the console application is a very simple C application that links
against libtidy
. It’s what powers Tidy, mod-tidy, and countless
other applications that perform tidying.
Please note that this content is adapted from the original on SourceForge.
Design
Design factors
-
libtidy
is easy to integrate. Because of the near universal adoption of C linkage, a C interface may be called from a great number of programming languages. -
libtidy
was designed to use opaque types in the public interface. This allows the application to just pass an integer around and the need to transform data types in different languages is minimized. As a results it’s straight-forward to write very thin library wrappers for C++, Pascal, and COM/ATL. -
libtidy
eats its own dogfood. HTML Tidy links directly tolibtidy
. -
libtidy
is Thread Safe and Re-entrant. Because there are many uses for HTML Tidy - from content validation, content scraping, conversion to XHTML - it was important to makelibtidy
run reasonably well within server applications as well as client side. -
libtidy
uses adaptable I/O. As part of the larger integration strategy it was decided to fully abstract all I/O. This means a (relatively) clean separation between character encoding processing and shovelling bytes back and forth. Internally, the library reads from sources and writes to sinks. This abstraction is used for both markup and configuration “files”. Concrete implementations are provided for file and memory I/O, and new sources and sinks may be provided via the public interface.
Implement
Implement libtidy
Once you’ve built libtidy
following the README instructions for cmake
in
HTML Tidy’s repository, you can get started using libtidy
. cmake
will
have built both the console application and the library for you.
Perhaps the easiest way to understand how to use libtidy
is to see a simple
program that implements it. Such a simple program follows in the next section,
and don’t forget that you can also study console/tidy.c
, too.
Before we look at the code, it’s important to understand that API functions that return an integer almost universally adhere to the following convention:
- 0 == Success
-
Good to go.
- 1 == Warnings, but no errors
-
Check the error buffer or track error messages for details.
- 2 == Errors (and maybe warnings)
-
By default, Tidy will not produce output. You can force output with the
TidyForceOutput
option. As with warnings, check the error buffer or track error messages for details. - <0 == Severe error
-
Usually value equals
-errno
. Seeerrno.h
.
Also, by default, warning and error messages are sent to stderr
.
You can redirect diagnostic output using either tidySetErrorFile()
or tidySetErrorBuffer()
. See tidy.h
for details.
Sample
Sample Program
#include <tidy.h>
#include <tidybuffio.h>
#include <stdio.h>
#include <errno.h>
int main(int argc, char **argv )
{
const char* input = "<title>Foo</title><p>Foo!";
TidyBuffer output = {0};
TidyBuffer errbuf = {0};
int rc = -1;
Bool ok;
TidyDoc tdoc = tidyCreate(); // Initialize "document"
printf( "Tidying:\t%s\n", input );
ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ); // Convert to XHTML
if ( ok )
rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics
if ( rc >= 0 )
rc = tidyParseString( tdoc, input ); // Parse the input
if ( rc >= 0 )
rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
if ( rc >= 0 )
rc = tidyRunDiagnostics( tdoc ); // Kvetch
if ( rc > 1 ) // If error, force output.
rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
if ( rc >= 0 )
rc = tidySaveBuffer( tdoc, &output ); // Pretty Print
if ( rc >= 0 )
{
if ( rc > 0 )
printf( "\nDiagnostics:\n\n%s", errbuf.bp );
printf( "\nAnd here is the result:\n\n%s", output.bp );
}
else
printf( "A severe error (%d) occurred.\n", rc );
tidyBufFree( &output );
tidyBufFree( &errbuf );
tidyRelease( tdoc );
return rc;
}
App Notes
Application Notes
Of course, there are functions to parse and save both markup and configuration files. For the adventurous, it is possible to create new input sources and output sinks. For example, a URL source could pull the markup from a given URL.
It is also worth remembering that an application may instantiate
any number of document and buffer objects. They are fairly
cheap to initialize and destroy (just memory allocation and zeroing,
really), so they may be created and destroyed locally, as needed.
There is no problem keeping them around a while for keeping state.
For example, a server app might keep a global document as a master
configuration. As documents are parsed, they can copy their
configuration data from the master instance.
See tidyOptCopyConfig()
. If the master copy is initialized at
startup, no synchronization is necessary.