For ECE-270A participants only 1/23/08 ECE-270A Homework #1: Knowledge base formation Files given on CD: 1) hw1_wordlist10k.txt 2) text_0_data.txt 3) text_1_data.txt The "hw1_wordlist10k.txt" file contains a list of 10,000 words and characters. The text data used was restricted to these 10,000 words and characters, therefore all the sentences in the text data can be indexed by this word list. The file looks as follows: 0 , 1 the 2 . 3 of 4 a ... 9999 adverse ####### Example matlab code for getting word list ####### FileHandle = fopen( ' hw1_wordlist10k.txt' ); LineCount = 1; while 1==1 % get next line up to 'end of line' character ThisLine = fgetl(FileHandle); % At EOF(end of file) ThisLine will equal -1 so quit while loop if ThisLine== -1 break; end % Individual words are blocks of all characters separated by % white spaces. This splits each word into its own place in a cell array. TheseWords = regexp(ThisLine, '\s', 'split'); % TheseWords is a cell array % TheseWords{1} holds the number % TheseWords{2} holds the word/character WordCell{LineCount} = SentWords{2}; LineCount = LineCount + 1; %count the current line end fclose(FileHandle); % WordCell is has 10,000 elements, one for each word % you can display the word "the" as WordCell{2} % or disp(sprintf('The word in the second position is %s',WordCell{2})) ####### Each text_0_data file contains ~250,000 individual sentences. Each sentence is on one line, which is ended by an "End of line" character. Note the above matlab code will also work for getting each individual sentence from the data files. Every word/character is separated by a whitespace and is all lowercase. Therefore an example sentence may look like: but now , in 1986 , there's a whole new generation . Notice the spaces between words and punctuation. As mentioned, each word and character including punctuation is separated by a space has a corresponding match in the wordlist. The regexp command splits the line into groups of characters separated by a space. A couple quick matlab notes: 1) Matlab is not the best for searching through text. You may want to look at languages like Perl if you want things to work fast with large text. 2) Matlab will work just fine, but you should plan on the text searching running for many hours, unless you get creative with using hashtables etc. in matlab. In matlab, you can use commands like: % loop over i for the whole sentence WordIndex = strmatch(TheseWord{i},WordCell, 'exact'); to find the index to each matched word. You may want to look at the command strvcat in order to convert the words in WordCell into a character matrix, and use that in the strmatch command instead of WordCell. This will work much faster. 3) You DO NOT want to build sparse matrices on the fly. Each time you add a sparse element the matrix gets rebuilt. A 10,000 x 10,000 matrix is only 800MB in double precision and 400MB in single precision(which is fine for counts). If you have the RAM, just make several full count matrices and convert them to sparse later. If not, then you can be creative with cell arrays, or a variety of other methods (like making a full matrix for each different knowledge base and running through the data on several passes). Good luck and have fun.