воскресенье, 28 июня 2009 г.

Description of the project

Description of the project

Hi everyone!

As I promised, here is a description of a project I’m currently developing.

So what prevents us from making computers understanding natural language?

The most common problem is ambiguity of words in our language. It means that we have some words that in different situations (let’s call it contexts) mean different things.

Examples could be bass – low sound and a fish, country – village and the state, head – head like body part and head of department and so on. Ambiguity (Wikipedia)

Word Sense Disambiguation (WSD) is one of the tasks of Computational Linguistics (CL) and Natural Language Processing (NLP) which defines the meaning of the word in some particular usage. Classical example is “bass”.

“That bass swims fast”, probably “bass” means a species of fish here.

To have successful WSD application your app should have some knowledge about language like people have.

Some of the sources of such knowledge you could get from:

1. An excellent knowledge base (semantic network, ontology, taxonomy, …) with very rich set of interconnections between nodes, like “who does what” or “what object has what properties”, derivations between words like “to shoot” – “a shot” and others. Almost all such interconnection in ideal way should be fuzzy. I mean it must have a number which defines the strength of that interconnection. Now it’s clear that WordNet is not enough. It’s not fuzzy, it has poor set of interconnections. It has different density of synsets (tightness between senses) which disrupts algorithms of semantic similarity which rely on path length between synsets by counting the number of edges between them.

2. A broad corpus of texts, which is big enough, includes texts from different domains and, of course, is semantically tagged. The meaning of “semantically tagged” is not exactly determined as senses are not a discrete distinct things in contrast with IDs. Maybe tagging should be fuzzy too. But the point is that such corpus creation requires such big efforts and is so expensive that as far as I know we do not have such corpus for now. We have some small ones, like Semcor which in my opinion is definitely not enough for machine learning approaches.

Ok, back to the project.

I’m currently developing a project which uses “backstage” language to understand the meaning of words in initial language. Each language has its synonymy and homonymy overlapping map and it differs in different languages.

I mean it’s hard to understand whether “bark” refers to bark of the tree or to some kind of ship if you don’t have at least one of those two sources I mentioned above. But you could use other languages (I use Russian) where bark of the tree and bark as a ship is written by different words, so you can investigate contexts of these words in “backstage” language and by various well known approaches find a sense of word “bark” in your particular place in initial language.

Also should be noted that more languages you have – better quality of this approach you can get.

Stay tuned, I’ll update this post with some appropriate example to describe how approach works.

Later I’ll publish some results I’ve achieved so far.

суббота, 27 июня 2009 г.

Introduction...

Hello world!

My name is Ivan Akcheurov.

I’m a software engineer.

I live in Ukraine now (Update: from July 2010 I live in Amsterdam Area of The Netherlands).

I’ve started my career from development of Machine Translation engine.

From that time Word Sense Disambiguation/Machine Translation became my passion.

Of course adjacent NLP/CL fields like IR/ Text Mining / Knowledge Extraction are also in scope of my interests.

Now I’m developing a project of finding the most appropriate translation for a given text.

The final goal is to get the senses of words from initial sentence.

But everything starts from the little.

So my project deals only with several words for now!

It knows nothing about syntax. The main thing it uses is unstructured data taken from internet. It is several gigabytes of posts from thousands of blogs. App has statistical approach which will be reviewed in later posts. In short, it finds similar phrases in other languages and decides what translation is better for some particular word from initial sentence. At the moment it’s based on cooccurrences but I plan to enhance this approach :)

As soon as I make some results description, comparisons and probably evaluation I’ll post it as well.

Stay tuned :)