Computers::The 300 degrees of human knowledge
Posted: Thu Jun 1, 01:46 PM
While taking my class on the theory behind Information Retrieval (aka. search engines) and Natural Language Processing, one topic that was discussed was Latent Semantic Analysis (LSA) and Latent Semanting Indexing (LSI). As usual, a paper reference at the end.
In a very gross summary, LSA involves making a huge matrix of documents on one edge, and words on the other. If a word is in a given document you can put a 1, 0 otherwise, (or a count of the words/phrases, etc.,there’s lots of room for creativity). Then, through some mathemagical algorithms, you can squash this huge matrix down to a lower dimension using a Singular Value Decomposition (SVD) which is a sort of unique summary of the characteristics and relationships of the big matrix you put in. With the SVD in hand, you can start scoring documents according to how much of a characteristic it has (read: you can get ranked search results).
Anyways, long story short, one of the parameters used in LSI is how many dimensions of the SVD that you want to use, like say, the top 10, 100, or 300 entries and so on. And here’s where the post title starts to come in, the performance of LSI gets better, or worse, depending on the number of dimensions used. And for really large data sets the best performing one seems to be about 300,
Okay, vague handwaving theory time is done.
So, what’s interesting about this? Well, the authors of the paper input into their implementation of LSI all the data of “Grolier’s Academic American Encyclopedia” which is designed for young students. From there they gave the system a synonym test often given to foreign students for admission into universities in the US, the Test of English as a Foreign Language (TOEFL), using 80 retired questions and compared to historic statistics collected by the test creators.
The result? Adjusted for guessing, LSI got 52.5% of the problems corrects. And foreign students? 52.7%. Ooooo that’s… kinda spooky isn’t it? At least for identifying the similarity relationship between words, LSI seems to do as well as an average foreign university student, and the authors note that the score is enough to get into typical universities in the US.
Further in the paper linked at the end, there’s a whole section where they compare LSI to children learning english, also pretty interesting to see.
So, does human knowledge only encompass 300 dimensions? I have no idea, after all, if you sorta look closely at how everything works, I find it hard to believe that’s all there is to how humans learn about relationships between words and things. But then, it’s not like I have better alternatives, at best just equally valid alternatives. So it’s just a spiffy thing to know about. =)
A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge. Landauer, T.K., Dumais, S.T.. Retrieved June 01, 2006
The Academiblog