treehouse comments

When I was posting the article on Gary Wilson on my Miscellaneous page, I noticed a tag in the New York Times's html for the piece. The article mentions someone picking up a burger at McDonald's, and that word is followed by this tag (invisible unless you're looking at the source code):

org idsrc="NYSE" value="MCD,MCJ,MCW"/

What's going on here? Is this a way for major advertisers to track how many mentions it gets in the Times? If so, that's fucked up, yo! I can't think of any innocent reason for doing this.
- tom moody 4-10-2002 5:32 pm

back to treehouse

Tom, your boundless skepticism gives me hope.

Anyway, this is harmless. This tag is one of many in the flavor of XML called NewsML (News Markup Language.)

XML is the generic technology. The idea is to make a markup language where, unlike HTML (Hyper Text Markup Language) which provides layout information, the tags provide semantic information. The effort is aimed at making information more understandable by machines.

Under the general XML umbrella, each industry (or interest group, or whatever) can agree upon their own standards. Maybe there would be AutoML (I don't know if it exists) where the tags would be things like:

tiptronic<part type="transmission" price="$1000" id="#356f94fa94"/>

Another computer (or your webservices enabled browser of the future) could scan the document, and where a human encounters the word "tiptronic," the machine would instead parse the XML part tag and extract the relevant information. The idea is just that computers always need information presented in a rigorously defined way, and XML allows groups of people who want to work together (or who want their computers to work together) to build their own rigorously defined languages for categorizing words.

What you see in the Times article is just a corporate stock price NewsML tag. Notice it is connected to the word McDonalds. To a human it's obvious they mean the burger company and not the guy who had a farm. But computers are not good at this. They can't pluck meaning very well from context. They need these things spelled out for them. So the tag is just saying that this is an organization (org) that is listed on the new york stock exchange (NYSE) under these ids: MCD,MCJ,MCW (I don't really understand the multiple entries here, but anyway....)

Your browser is made to understand HTML. Part of the HTML spec is that browsers, when rendering pages, should ignore any tags that aren't in the spec. So this tag is just ignored. No funny business is going on.

But your instincts are correct in general. If the times had embedded, say, a 1 pixel by 1 pixel white .gif (that you wouldn't see - this is called a "web bug",) and the .gif in question was being called from a New York Times server, then everytime someone loaded your page (with the copied NYT article) a sub request would be made to the Times server which would see the URL of your page. This is a widely used way to track people.

The same thing is even more pervasive in email. If your email client is set to display HTML, then every time you open an email that contains a picture (which might be a web bug that you can't even see) your email client is giving away the fact that you are reading the message. A clever email tracker would make each .gif a different file name (like 1.gif, 2.gif, 199923239323.gif, etc...) so that when I open the email they know it (because they check their server logs to see when the .gif with my unique number on it was accessed.) And if I forward the message to anyone they can track that too!

Pretty tricky. This is the main reason I encourage all people to set their email clients to display plaintext and not HTML - unless you have an advanced email client that can display HTML, but specifically NOT download anything from the internet at large. In that case it replaces images with squares of blank color, but otherwise keeps HTML formatting.
- jim 4-10-2002 6:32 pm [add a comment]

Thanks--it still sounds fishy to me, though. All this tracking and countertracking reminds me of the nanobots in Neal Stephenson's The Diamond Age. The self-replicating spybots and anti-spybots floating in the atmosphere eventually get so thick that children start developing asthma.
- tom moody 4-10-2002 7:03 pm [add a comment]

ive got one for the nytimes mixed message department. they front an article about publishers and authors upset over amazons selling of used books but have an ad for half.com on their homepage.
- dave 4-10-2002 7:43 pm [add a comment]

I saw that piece (but not the ad). It's Napster all over again: the copyright conglomerates are crying rip-off, but the authors are linking to Amazon on their home pages because they want to get their words out there.
- tom moody 4-10-2002 7:48 pm [add a comment]

[home] [subscribe] [login]