Websnob > RDF Snob > rdfscrape.pl

rdfscrape.pl

rdfscrape.pl is a command-line Perl script that creates RDF/XML metadata files for autodiscovery and indexing. It is not a XSLT processor. It's a big, ugly Perl script that uses every stupid trick it can to produce semi-useful RDF metadata for a file. Most of the RDF/XML is formatted according to the Dublin Core Metadata Intitiative's specifications.

For example, rdfscrape.pl produced the following record of this web page:


<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
    xmlns:admin="http://webns.net/mvcb/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description rdf:about="">
    <admin:generatorAgent rdf:resource="http://purl.org/rdfscrape/pl/0.92"/>
    <dc:date><dcterms:W3CDTF rdf:value="2004-03-09T06:42:31Z"/></dc:date>
</rdf:Description>

<foaf:Document rdf:about="http://www.bauser.com/websnob/rdf/rdfscrape.html" xml:lang="en-US">
    <dcterms:conformsTo rdf:resource="http://www.w3.org/TR/REC-html40/loose.dtd"/>
    <dcterms:created><dcterms:W3CDTF rdf:value="2004-02-01"/></dcterms:created>
    <dc:creator rdf:resource="mailto:michael@bauser.com"/>
    <dc:format><dcterms:IMT rdf:value="text/html"/></dc:format>
    <dc:language><dcterms:RFC1766 rdf:value="en-US"/></dc:language>
    <dcterms:modified><dcterms:W3CDTF rdf:value="2004-03-08T20:39:52Z"/></dcterms:modified>
    <foaf:name>RDF Snob: rdfscrape.pl</foaf:name>
    <dc:subject>RDF, XML, Perl</dc:subject>
    <dc:title>RDF Snob: rdfscrape.pl</dc:title>
    <dc:type><dcterms:DCMIType rdf:value="Text"/></dc:type>
</foaf:Document>

<foaf:Person>
    <foaf:mbox rdf:resource="mailto:michael@bauser.com"/>
    <foaf:made rdf:resource="http://www.bauser.com/websnob/rdf/rdfscrape.html"/>
</foaf:Person>

</rdf:RDF>

Which is a pretty good description considering this page only contains one Dublin Core meta element, DCTERMS.created. The rest of the information was extracted from other HTML elements and/or basic filesystem information. Most HTML (and XHTML) files will contain some metadata that rdfscrape.pl can use.

Besides a web page to scrape for metadata, all rdfscrape.pl requires is perl5, and some Perl modules that are included in standard installations of perl5. Windows users can even use ActivePerl.

A ZIP file of the current version of rdfscrape can be downloaded from this link.

Version History

Version 0.98 (28 June 2004)

Enhancements: Started validating DCMIType encodings against the specification, and including dc:type properties (in addition to rdf:type properties) in the output. Improved descriptions of HDML/WML.

Internals: Moved all of the print statements to the end of the script (which will make it easier to create versions of the script using alternate serialization formats, if I ever need to do that). Renamed the variables again and finally declared them properly enough for 'use strict'. Got rid of the @misc array entirely, which greatly simplifies HTML parsing.

Download it.

Version 0.97 (27 June 2004)

New Features: Added support for some Creative Commons vocabulary terms. Improved descriptions of HDML/WML.

Bug fixes: Replaced the foaf:Person subroutine with a foaf:Agent subroutine, because it's more accurate (I can't guarantee that the object address of an author belongs to a single person.) and more flexible (I can create foaf:Agent nodes for the objects all DCMI agent terms.) Fixed case-smashing problem in processing of link element.

Internals: Merged sys_info() into MAIN routine. Merged print_separate_rdf() and print_single_rdf(). Added separate subroutines for processing some of of the trickier DCMI elements. Changed a lot of variable names to be less stupid, and properly declared most of them.

Download it.

Version 0.96 (16 June 2004)

Enhancements: Added support for one of my least favorite meta values, http-equiv="Refresh". Added support for newly-approved DCMI elements DCTERMS.license and DCTERMS.rightsHolder. Added support for link elements with multiple relationships. Improved support of pages using old DCMI recommendations for HTML.

Bug fixes: Fixed a typographical error in the dcmi_serialize() subroutine that caused rdfscrape.pl to ignore DC.isReferencedBy. Fixed conversion of Description.

New features: Added routine for parsing HDML and a very basic routine for SGML files.

Removed feature: Got rid of the horrible routine for describing Perl resources.

Download it.

Version 0.95 (6 June 2004)

Almost ready for prime time!

Enhancements: Merged the four main serializing subroutines into two (one DCMI and one non-DCMI), which reduces memory load and makes it easier to add subroutines for new media types.

New features: Added a subroutine for scraping WML (Wireless Markup Language) files -- DCMI metadata can be added to WML using meta elements, just like HTML. Added basic support for amk.ca's Review schema. Added support for relative URIs in link elements.

Bug fixes: Corrected checking of RFC1766 language codes in HTML attributes. Corrected serialization of meta elements with DCTERMS.URI scheme.

Feature change: --base command line option renamed to --root, to reduce confusion with the xml:base attribute. (Maybe I should add a switch for xml:base -- does anybody use that?)

Feature removed: Took out version 0.94's content-negotiation hack, because I think it was interfering with the assigment of urirefs.

Download it.

Version 0.94 (18 May 2004)

New feature: Cheap hack to recognize HTML/XHTML content-negotiation using Apache's ModViews directive -- if rdfscrape.pl processes an HTML file and an XHTML file with the same BASE HREF content, if adds an rdf:Alt container to the output RDF/XML. (This is the best idea I have so far for describing content-negotiation; anybody got a better one?)

Download it.

Version 0.93 (11 March 2004)

Bug fix: Using --vocab f with multiple output files was creating badly-formed XML. Fixed that.

New feature: Added a half-assed routine to describe Perl files by reading their embedded POD data. Use at your own risk.

Internals: Straightened out the resource/handler routine.

Download it.

Version 0.92 (7 March 2004)

Improved FOAF support: Creates foaf:person records based on link rel="author". Adds foaf:logo elements based on link rel="icon".

Enhancements: More selective about utilizing certain meta elements and improved anti-redundancy checks for several others.

Minor additions: Recognizes link rel="copyright", link rel="contents", and meta name="netinsert".

Download it.

Version 0.91 (15 February 2004)

Two new features: Parses Dubin Core link elements, in addition to meta. Uses the URI from the page's DTD declaration as the value of a dcterms:conformsTo element.

Enhancements: Better handling of language attributes and geographic meta elements.

Bug fixes: Better checking that declared properties are valid Dublin Core properties. Better error-checking of input data (especially language codes.) Removed some unnecessary stringification and redundant function calls. Removed some questionable handling of an obscure link value.

Download it.

Version 0.9 (1 February 2004)

The public test version of rdfscrape.pl implements most of the Dublin Core Metadata scheme, and includes provisional support for the FOAF, MusicBrainz, and WOT XML vocabularies.

Download it.

[an error occurred while processing this directive]