воскресенье, 28 июня 2009 г.

Description of the project

Description of the project

Hi everyone!

As I promised, here is a description of a project I’m currently developing.

So what prevents us from making computers understanding natural language?

The most common problem is ambiguity of words in our language. It means that we have some words that in different situations (let’s call it contexts) mean different things.

Examples could be bass – low sound and a fish, country – village and the state, head – head like body part and head of department and so on. Ambiguity (Wikipedia)

Word Sense Disambiguation (WSD) is one of the tasks of Computational Linguistics (CL) and Natural Language Processing (NLP) which defines the meaning of the word in some particular usage. Classical example is “bass”.

“That bass swims fast”, probably “bass” means a species of fish here.

To have successful WSD application your app should have some knowledge about language like people have.

Some of the sources of such knowledge you could get from:

1. An excellent knowledge base (semantic network, ontology, taxonomy, …) with very rich set of interconnections between nodes, like “who does what” or “what object has what properties”, derivations between words like “to shoot” – “a shot” and others. Almost all such interconnection in ideal way should be fuzzy. I mean it must have a number which defines the strength of that interconnection. Now it’s clear that WordNet is not enough. It’s not fuzzy, it has poor set of interconnections. It has different density of synsets (tightness between senses) which disrupts algorithms of semantic similarity which rely on path length between synsets by counting the number of edges between them.

2. A broad corpus of texts, which is big enough, includes texts from different domains and, of course, is semantically tagged. The meaning of “semantically tagged” is not exactly determined as senses are not a discrete distinct things in contrast with IDs. Maybe tagging should be fuzzy too. But the point is that such corpus creation requires such big efforts and is so expensive that as far as I know we do not have such corpus for now. We have some small ones, like Semcor which in my opinion is definitely not enough for machine learning approaches.

Ok, back to the project.

I’m currently developing a project which uses “backstage” language to understand the meaning of words in initial language. Each language has its synonymy and homonymy overlapping map and it differs in different languages.

I mean it’s hard to understand whether “bark” refers to bark of the tree or to some kind of ship if you don’t have at least one of those two sources I mentioned above. But you could use other languages (I use Russian) where bark of the tree and bark as a ship is written by different words, so you can investigate contexts of these words in “backstage” language and by various well known approaches find a sense of word “bark” in your particular place in initial language.

Also should be noted that more languages you have – better quality of this approach you can get.

Stay tuned, I’ll update this post with some appropriate example to describe how approach works.

Later I’ll publish some results I’ve achieved so far.

1 комментарий:

  1. (reposting this, hopefully blogspot doesn't eat it this time :)

    Hi! Check http://www.lexvo.org/linkeddata/resources.html and that whole site. They have loaded several language-oriented datasets, giving Web identifiers and machine readable blabla for names for languages but also a few (linked) dictionaries. I don't know how wide the coverage is.

    For the multi-lingual wordnets, most are closed. The Dutch one is (informally?) open and in RDF, it's called Cornetto. See http://ckan.net/package/cornetto

    Also http://omegawiki.org tries to cover all languages within a single framework. And Wikitionary has got better recently.

    I was thinking Microsoft's ngram service might also be useful for you but looking at http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx it is en-US only for now.

    For SKOS datasets (roughly RDF for thesauri and similar) see http://www.w3.org/2001/sw/wiki/SKOS/Datasets ... there are a few that are multi-lingual to some extent.

    ОтветитьУдалить