24th May 2005

Semantic Bank Robbery

I’ve stolen the following chunk from the Semantic Bank.  Partly I just wanted to see how the "urns" would turn out here in TypePad land… if it’s all Greek to TypePad, then I guess that will make them Grecian urns.  Okay… here goes the paste part:

quote…

Came across Clay Shirky’s talk at the O’Reilly Emerging Technology conference entitled "Ontology is Overrated: Links, Tags and Post-hoc Metadata". It’s worth listening to.

Just like me, Shirky is a

lakoff-ian (excuse the neologism): categories are embodied, espression of humanity, not abstract metaphysical entities (Plato’s ideas) that we aim to obtain.

I wrote about this already.

Still, Shirky misses one important point: ontologies are not overrated, they are just contracts, a (more or less explicit) agreement between different parties. Language is a contract as well. So are categories. So is metadata. So are APIs, protocols, plug shapes and their voltage, meters…. you name it! Many make the mistake of associating an ‘ontology’ with Plato’s metaphysical ideas, I think Shirky is one of them.

The semantic web is a bad name for an attemp to make data interoperability scale at a web level. Ontology are a bad name to describe relationships between symbols. That’s all there is, really.

Now, you use tags to categorize things for yourself, but instead of using a ‘controlled vocabulary’, taxonomy or ontology (depending on what field you come from, you will like to call them differently… which also is a metaproof of the point, but let’s move on), you invent your own.

People have been doing this forever. I mention Borge’s essays about this in another post.

Now, the real breakthru of folksonomical-based systems like del.icio.us or flickr is not the lack of structure or commitee-based design in the ontological space, but is the idea that if two people use the same term, it’s more probable than they meant the same thing than they meant different things.

That’s the secret sauce: it’s unlikely that a farmer would use del.icio.us to bookmark a page on how to grow apples, so "apple" in that sociological context means Apple Computers, nor fruits. What happens if it’s not? who cares!

This is the point where librarian exit the room screaming and I’m left there staring at the wall, thinking on how to enable ontologies to emerge out of the power law foam, but without librarians to puke on it and without people telling me to stop thinking like a librarian!

The problem is rather simple, really: words are not unique identifiers for concepts. Everybody knows this very well: synonyms exist in every language. So, all you need to start is to create unique identifiers for your tags, but if you don’t do it well enough, it doesn’t scale globally.

Let’s start with a tag that I use a lot in my bookmarks: semweb.
In my use, this string (contraction of "semantic web") refers to things
that are related to the "semantic web". So, in order to promote the
exchange of this tag, I create an identifier (URI) for it:

urn:tag:3f7d0330e767ddab5b2826371e2d21ff/c2Vtd2ViCg==

Now, let me decode the above:

urn:tag:[MD5 hash of my email address]/[base64 encoding of the tag]

There:
this is a unique identifier that is connected to my email address,
therefore reasonably unique because domain names are kept unique by
registration authorities and mail protocols don’t allow two distinct
accounts to share the same name. Also, the email address is hashed, to
avoid abuse (by spammers, for example): only its unicity property is
required, the rest of the information can be discarded. The base64
encoding of the tag is to ‘obfuscate’ its originating string, yet
base64 (unlike hashing) is a loss-less algorithm, which means that even
if we lose information about the label connected to that identifier we
can reconstruct it later. This is redundant, but enables better digital
preservation in the long term.

But why numeric obfuscation of the tag? Well, humans can’t avoid
parsing textual information even in URIs. This is bad because it might
produce unwanted side effects. For example, two different ethical
groups in conflict might not like to use a URI that was created based
on the textual representation of that concept in the rival language.
This will unlikely promote the reuse of that URI. Sure, we could have
used an incremental counter for those tags and forget about reuse (we
will see why in a moment), but that requires different systems that you
might use to be kept synchronized. It’s way easier to think that you
will unlikely use the same string to use different meanings in the same
context.

So, now that we have an identifier, we can start making statements about it:

    a tags:Tag ;
    rdfs:label "semweb"@en .

Great, now we know this "thing" is a tag and has a label "semweb" in the english language.

Now, let’s say that a friend of mine uses "semweb" as well,
buthaving a different email address, the identifier of his tag will be
different, even if he uses the same string. Well, nobody ever said that
inferencing on RDF statements should always follow description logics:
if we have twostatements that share the same literal, then we can say
they are folksonomically "colliding". So, we now have a model that
says, after the inferencing:

    a tags:Tag ;
    rdfs:label "semweb"@en .

    a tags:Tag ;
    rdfs:label "semweb"@en .

    tags:collidesWith

Note
how a syntactic collision does not automatically imply a semantic one!
It’s easy to identify that two tokens are the same or not syntactially,
but it’s a lot harder to understand if whether or not they refer to the
same ‘concept’, or even if such a thing is even remotely possible,
given how subjective semantic meaning deeply is.

But as folksonomical systems do, we can assume that, for linguistic
efficiency reasons, otherwise noted, two colliding unique tags will
mean to reference the same semantic notion.

At this point, we just cloned a folksonomy with the semantic web,
but we have just increased (a lot!) the complexity. Where is the gain,
I hear you asking?

Well, let’s go back to the ‘apple’ example. Say I use "apple" to
mean "apple computers" and my friend, Beatles fan, means "apple
records". So, we have

    a tags:Tag ;
    rdfs:label "apple"@en .

    a tags:Tag ;
    rdfs:label "apple"@en .

     tags:collidesWith

Just
like the above. But then my friend, who also merges my tags with his,
notes that my use of the "apple" tag is semantically different than
his, so he "disambiguates", by adding the following statement into his
model (don’t worry, not by hand, a UI will guide him):

    owl:differentFrom

With
this statement in place, whenever the system re-inferences over the
statements, it can undertsand how, for my friend!, my "apple" tag and
his "apple" tag mean different things, so his system will not
clusterize my data tagged with that tag in the same category as his.

Now librarians can breath again :-)

But there are other benefits: say my friend also used "semantic_web"
along with "semweb", because he’s a messier tagger or simply because he
never realized, I can realize that for him and produce the following
statement:

    owl:sameAs

So,
ironically, using an ontology, and without reducing functionality, we
solved the two biggest problems that current folksonomies have:

  1. syntactic collisions can be differentiated
  2. syntactic differences can be equated

But there is a lot more!

Now that all tags are uniquely identified and can be discriminated,
we can also make statements about their own relationships! So, for
example, if I have a tag "RDF" and "semweb" I might want to link them as

"RDF" -(technology of)-> "semweb"

and
we could use the same URI creation process not only for the tags (the
nodes) but also for the link (the arc between the nodes):

    a tag:Link ;
    rdfs:label "technology of"@en .

  

 

     

.

Note
that the lack of readability of that statement is a feature, not a bug:
these instructions are meant for machines to be processed, not for
humans.

Now, imagine a system where the disambiguation of two tags yields a
return on your metadata investiment that you consider worth it (if you
own an iPod you know what I mean by return on your metadata
investiment!), the ability to share this information across systems
(say between flickr and del.icio.us, or even between your own blog, or
even your email client or calendar system!) will very likely
revolutionize the way we do things and allowing to pick statement
between the people, organizations,entities that you like,
will allow you to disagree, to avoid massification, to avoid feeling
locked into a platonic semantic cage.

There is nothing in semantic web technologies that states that
ontologies cannot be created by individuals for their own benefits and
shared and mapped according to their invidifual or group tastes. There
is nothing that states that the only way to make data interoperate is
thru uber conceptual models (CIDOC CRM) or thru common denominator sets
(Dublin Core).

And, to prove the point, we have built a system on this :-)

Stay tuned.

many thanks to David Huynh and Prof. David Karger for the invaluable support and help in the discussion of the folksology concept as layed out in this note.

                                           

…end quote

This entry was posted on Tuesday, May 24th, 2024 at 11:35 and is filed under Tools and Technology, Gadgets and Gizmos. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

  • Google Search

  • October 27 -- Demonstrate for Peace

  • oct27.org web button
  • Archives