A PowerPoint Presentation Indexer

(Version 2.2)

Home
Back To Tips Page
How it works inside!

One of the constant complaints I get about my two major courses is that there is no index to the presentation.  I do not understand why Microsoft has not managed to solve this problem after all these years; apparently providing base functionality is not as important as coming up with new interfaces that don't really add anything to the base functionality of the programs.  So I had to solve it on my own.

It took as given that I was not going to try to create a self-contained solution.  I don't write in Visual Basic except under duress (the VBA documentation ranks as some of the very worst documentation produced, exceeded in poor quality only by Unix 'man' pages.  It is incoherent, disorganized, has no overview, the worst cross-referencing system I've ever seen, and overall it is impossible to learn how to do anything from it.  Fortunately, there are some people on the microsoft.public.office.developer.vba newsgroup who were able to show me the core algorithms).

So the result is that I now have a program that can generate an index for every single word in a PowerPoint presentation.  The problem is that this is rather useless. since it indexes words like "the", "an", "maybe", "can", "that", "these", and "those", among other useless words.  So the question was how to make a program that understood some of the real issues of indexing.

This is a work-in-progress.  As I continue working with it, I will add more facilities.

The core components of the system are the following

Indexing Philosophies

Normally, we index using "positive indexing".  That is, in Word, or FrameMaker, we add special indexing text marks, which are metadata marks.  These marks do not appear in the printable rendering but are used by the processor to generate index data.  This gives maximum flexibility.  In professional systems such as FrameMaker, the index entries themselves can contain markup data to control the font used in the index.

Unfortunately, Microsoft has not really understood their Office product line.  Publisher and PowerPoint can't generate indexes.  Publisher can't generate page references.  There is some delusional system in which they believe that no one could ever need the obvious facilities.  However, it is possible to "hijack" the PowerPoint Notes section for this purpose, and I have done so.

So the indexer supports both "positive indexing" and "negative indexing".  You choose which one you find most effective.  One nice thing about the "negative indexing" is that it can guide you to adding tags in the right places, and that is very convenient.

My style is to spend some time doing "negative indexing", dropping words which are clearly irrelevant, and when I'm finished, I use the words as a guideline to adding the "positive indexing" tags.  By using the tool to directly modify the presentation, I can index a 1200-slide presentation with less than a week's effort.

Content indexing and Text Transformations 

There are problems with using negative indexing.  For example, some words start sentences, while others appear in the middle of sentences.  In English this means that some words are capitalized while others are not.  But some words appear in all-upper-case or all-lower-case, and it would be an error to change them for purposes of indexing (one way to tell an incompetent editor is someone who insists on English rules of capitalization for non-English words, e.g,  a section in a book called "Strcpy, Strcat and Strlen".  Those words are not English, and must not have bizarre rules applied that change their meaning.  A classic example of this is the C++ book by Stroustrup, where English capitalization rules are applied to C++ methods).

An unfortunately side-effect of using negative indexing is that I've had to go back in and reword some slides so they indexing terms, particularly when phrases and permutations are required, are consistent.  This has resulted in some redundancy in phrases.  For example, I can't write "This and that thing" if I've already indexed the phrase or permutation "This thing"; I had to rewrite it as "This thing and that thing" so the indexing was consistent.  But lacking a positive indexing mechanism, it was all I could do.  This is why I ultimately figured out how to add positive indexing to the presentation.  It was becoming too difficult to do negative indexing.

Some words, which appear in titles, should not be indexed in all-caps, so a title like REPLACING STRING CONTENTS should have index entries such as "Replacing", "String" and "Contents". but a heading such as "strcmp and LC_COLLATE" should not be indexed as "Strcmp" and "Lc_collate".  There are user-supplied rules which can ensure that these words get properly indexed.

I have not taken time to create locale-specific rules for some of these situations.  The rules below are all hard-coded in the program at this point.  Since I only know one other language, I don't feel qualified to extend this program to languages about which I know nothing.

Default transformation rules

First, I generated a set of heuristics for dealing with text transformations.  They will do about 90% or more of all cases, but they don't work everywhere.  The simple rules are

To handle the exceptions, a set of rules can be added explicitly by the user, by giving a "pattern" word and a "transformation rule"

Case transformation rules (user-supplied)

These rules override the default case transformation rules.

Pattern Effect
all lower case mark as being unchanged, indexed as all lower case
exact pattern match normalize by converting to all lower case and capitalizing the first letter
exact pattern match convert to all-upper-case
exact pattern match convert to all-lower-case

Plural-to-singular transformation rules (user-supplied)

To handle plurals, I chose to drop them to singular form, and added two rules where the user supplies the pattern word in plural form and the singular form is derived (for verb forms, it is actually the opposite; the plural form is derived from the singular form, because the rules of English verb formation).

Pattern Effect
plural form ends in s drop the s
plural form ends in es drop the es
plural form ends in ies replace -ies with -y

Verb tense transformations (user-supplied)

I chose past-to-present-tense transformations by adding two rules

Pattern Effect
past tense ends in d drop the d
past tense ends in ed drop the ed
gerund form ends in ing drop the ing

Note that the singular and plural forms of verbs can be handled by the -s/-es/-ies rules.  Note that the gerund form will drop a double-letter at the end of the word, e.g., "dropping" becomes "drop"

Non-indexed word detection (hardwired)

Pattern Effect
've word is dropped
'll
't
'd
I'
1-character words
numbers

Computer numbers (hardwired, with user-supplied override)

My presentations are all about programming.  So a "number" has a fairly broad description.  It is not just a sequence of digits.  An ordinary decimal integer can end in u or U, or l or L. It can also be a floating-point number, a hexadecimal number, and a hex number can have many different representations such as 0x3be, 3beh, 0x3BEL, or even just plain 3BE.  Floating point numbers can have an exponential form, and can end in f and integers can end in u or U or l or L.  The recognition of a number is hardwired into the program.

The problem with hexadecimal numbers is that they can also spell English words, so I had to provide an "exception" mechanism.  Typical exceptions might be the word 'bad' or 'face'.  In addition, because of the suffix rules, the Access Control List acronym, ACL, looks like a long hexadecimal constant for the numeric value 172.  Indexing function keys, which have names like F1 and F10, would also be impossible unless they could be treated as exceptions.  So the rule is that after a hex number is identified, it is compared to a user-supplied list of exceptions, and if it matches the exception, it is treated as a word.

Phrase and Permutation rules (user-supplied)

Sometimes what needs to be indexed is not a single word, but a phrase, such as "virtual memory" or "memory descriptor list".  These can be indexed in one of two forms: just index the phrase, or index the phrase and permutations of it.  For example, it might be desirable to index "virtual memory" as both "virtual memory" and "memory, virtual".  To handle this case, the user can supply rules to index phrases or index permutations of phrases.  For simplicity, the permutation is limited to 2-word and 3-word phrases.

Hyphenated words (user-supplied)

Sometimes it is desirable to index hyphenated words.  User-supplied rules can be supplied to indicate hyphenated word indexing.  In the absence of such a rule, the individual words are indexed.

Automatic Rule Generation

When a rule for certain transformations (such as dropping suffix letters) is provided, and it is typed in all lower case, an automatic rule that matches the capitalized version of the word is generated.  This means we typically have to supply only half the rules actually required.  So if a rule of the form

texts=*DROP_S

is generated, an automatic rule

Texts=*DROP_S

will also be generated.

This automatic generation also applies to hyphenated phrases.

Heuristic approach

These rules are often heuristic.  The result is an index that may be a bit bulky, and may have some odd capitalizations here and there, but is overall better than having no index.  This is the downside of the negative indexing approach.  But it infinitely better than having no index at all.

The PowerPoint Indexer

Opening the Presentation

Whenever the program runs, the File menu as an option Open PowerPoint File.  This will open the file and read its contents into the program.  The text discovered will be displayed in the lower window.

If the PowerPoint file is called Presentation1.ppt, the loading of the index will read in two other files, if they exist.  Presentation1.nul is the list of "Null words", the words which will not be indexed.  Presentation1.rule will read in a list of transformation rules, such as rules to drop the "s" from a plural word so that there is only one form indexed.  If these do not exist, you can created them yourself after creating a list of null words or rules.

If you use the Open button or the File > Open PowerPoint File... menu item, the File Open dialog, you have the option of selecting the source of indexing text: Text means to use the slide contents; Notes means to look for specific indexing marks in the Notes section, and Both (the default).  When a file is opened using the Recent Files menu it is always opened with Both.  To exercise specific control, the indexing mode can be turned on or off using Notes rules, and this is the recommended practice.

If you want the slide titles in the table of contents, check the Include Slide Titles option before reading the input file (the TOC is generated during file read).  The slide title will appear for any slide that does not have a *HEADINGn= declaration in its notes.

Generating the Index

The index will be generated based on the words that are discovered.  Click the Index button or use the File > Write Index File menu item to write the index file out.  The "index file" is actually three HTML files.  The main file is a "frame set" file, which by default is named after the input index file name.  So by default, the index file would be called Presentation1.htm.  Each of the frames of this file references another file.  The left panel is the quick-index key, and will be called Presentation1-ix.htm.  The right panel displays the contents, and is called Presentation1-ct.htm.  Whatever name is chosen for the main file will be suffixed with -ix and -ct for index and contents.

Analyzing the Index

The bulk of the work is in creating the various rules that will help create a usable index.  A skip-word is added by selecting the word in the "Words" list and clicking the ð button to transfer a word from the Words Found list to the Words Skipped list.  The ï button will transfer a word back.  A transformation rule is created by selecting the desired transform in the Rule box (a dropdown selects the standard rules) and clicking the ï button, at which point a dialog will ask what form of the word should be detected to apply the rule, as described in the Transformations Rules section. 

A Tour of the User Interface

The file management section

These represent the major operations: Open, Generate, Index

The Open button opens a .ppt file and reads all the lines.  This can also be accomplished by using the File>Open PowerPoint File... menu item.  When either of these options are used, you are given the choice of what parts of the file will be used for indexing.  The choices will be to use the text of the presentation, the special tags in the Notes section, or both.  The default is Both.

The Generate button analyzes the lines of the .ppt file and locates all the words.  It will not be enabled unless a presentation file has been loaded.

The Index button writes out an index based on the generated word table. This can also be invoked by the File>Write HTML Output File... menu item.  This button will not be enabled until a Generate has completed.

The Limit Slides check box allows you to limit the input to the a range of slides as specified in the edit controls which will be enabled if this option is selected.  This allows reading in a few slides of the presentation to work with just a particular section (or in my case, allowed me to quickly read a few slides to debug pieces of the program).  Note, however, that adding null words or generating rules when the entire context has not been read in can be misleading; you may eliminate words that are actually meaningful in other parts of the presentation.  The Limit Slides in a multi-presentation indexing task (see *INCLUDE) is of questionable value, because it limits the slides only in the top-level presentation.  The settings are ignored in the included presentations.

The Stop icon is enabled whenever there is background processing.  Note that clicking the Stop button may not cause the program to stop immediately, because there may already be queued messages that must be processed; it merely stops the thread from queuing up more information to be processed.  Once the information has been processed in the background, you may have to wait for the foreground to absorb the data.

The words found in presentation display

After the Generate is performed, this shows the words and phrases found in the presentation.  Words and phrases found as a consequence of elements in the Notes section have a slightly different background.

The upper ListBox display shows the set of words that will appear in the index.  The (n) value indicates how many instances of the word appear in the index.  Note that this represents the number of instances of the word; if the only appearance of the word is where it appears six times on a single slide, it will have a count of (6).  However, in the output index the slide number will appear only once.

If the word appears "too frequently", it may be useless as an indexing term.  The Overuse control sets the number of instances that count as "overuse".  In this example, it is set to 10, so the words Flag and Flags are shown in boldface and in a highlight color.  This makes it easy to find words that may simply be omitted because they would require more searching than would pay off (for example, in my one course, the word Address appears 262 times!)

To expedite the human analysis of the index, the Fast Index buttons, which include the to move to the top and bottom of the ListBox, and the keys A-Z to index into the list.

The number of items in the list is shown in the top right, in this case, 5057 elements in the list.

The lower ListBox shows hyphenated phrases that have been found.  This is useful if compound hyphenation phrases will be added by a *HYPHEN transformation rule.

Although this picture does not show it, there is now a typein box at the top of this display.  This allows you to type in an initial phrase to quickly find it in the list.  However, for performance reasons, even if the Sync to PowerPoint option is selected, there is no attempt to synchronize on every keystroke.  The presentation will not be synchronized with the selection found by the typein until you move to some other control.  When the edit control loses focus, the presentation will be synchronized if the Sync to PowerPoint option has been selected.  The most common way this is done is to simply type the Tab key (therefore it is easy to use Shift+Tab to return to the typein window(

The PowerPoint Text Display

The text of the PowerPoint display is shown at the bottom of the window.  The number at the top left indicates how many lines there are in this display.  In this case, it is 13384.  Note that some lines may be empty, and there are multiple lines for each slide.

When a word is selected in the Words Found In Presentation display, the lines which contain the selected word are highlighted.  The currently-selected line is shown in a slightly darker color.  The selected word is also shown in boldface.

Double-clicking on a line in the PowerPoint Text Display will bring up the presentation in PowerPoint and position you on the slide.  In practice, the Slide column may tell you the slide number is 53, but the actual slide number you see might be different.  The reason is that since you started the indexing process you might have inserted or deleted some slides, so the slide number has changed.  However, the Indexer actually interfaces to your presentation in terms of the slide ID, an internal number which is unique to each slide.  It is assigned when the slide is created and is never reused.  So to keep the expected behavior, which is not to go to slide #53, but to go to the slide which contains the text shown, I use the slide ID, which is independent of the slide number.  Note that if you drag slides around, they retain their slide IDs, so will be found even if moved within the same presentation.

In a two-monitor system this is highly effective because the PowerPoint Indexer can be on one screen and PowerPoint on another.  Here's a sample of the synchronization. In a single-monitor system, the "Floating button box" feature will be more convenient to use; see View > Floating Buttons.

In this example, the user has clicked on the #ifdef (10) line.  This creates a list of the 10 instances that appear in the presentation.  Then in the lower window, the text lines are all highlighted.  The Slide Locator window displays a list of these slides.  The buttons below the Slide Locator allow the user to move forward and backward in this list, and the corresponding line in the lower window will be highlighted in a slightly more emphatic color.  Double-clicking on the line (which is on slide 1534) will bring up the slide in PowerPoint.  In some cases, it is actually convenient to automatically couple the display in PowerPoint to each advance, and this can be done by checking the Sync to PowerPoint check box in the Slide Locator.

The Slide Locator

The Slide Locator provides a list of the slide numbers that contain the selected word.  These are the slide numbers that would appear in the finished index.

In addition, the four buttons at the bottom allow you to scan through the presentation to find all the instances of the word in the presentation.  The Beginning and End buttons move you to the first and last instances of the word.  The Next and Prev buttons move to the next or previous line that contains the selected word.  Note that this is not the next or previous slide, but the next or previous line.  Thus the Next button will move to the first to the second to the third instance of a word on a slide.

When the Sync to PowerPoint option is checked, then each time a new slide is selected, the slide will appear in PowerPoint.  This saves have to double-click each line in the text display window.

A progress bar is shown below the buttons to show the progress through the list of candidate lines.

 

Scanning for slides of interest

There are four sets of "scan" buttons, .  These will move to the first slide of interest, previous slide of interest, next slide of interest, and last slide of interest.  Depending on the current state, any of these buttons, or all of these buttons, might be disabled.

One set is under the Slide Locator window.  It is described in the Slide Locator description

One set is designated Errors.  If an error is found while reading the presentation, the line containing will be selected.  If the Sync to PowerPoint option is checked, the presentation will be activated and the appropriate slide will be displayed.  If you hover the mouse over the line that displays the error, an explanation of the error will pop up:

One set is designated Notes.  If there is a *TODO= directive in the presentation, this will move from one of these to another.

One is designated Contents-indexed.  The philosophy I use is to read in the presentation, drop out all the words I want skipped, and then use the resulting word list as a guideline to what I want to index.  What I do is find a slide, put a *SKIPCONTENTS=1 directive into it, and then index the terms I want using *PHRASE=.  Ultimately, I will have a *SKIPCONTENTS= in every slide, which is how I know that I've properly indexed the presentation.  I needed a way to find the few remaining slides that did not have *SKIPCONTENTS=1 in them.  The count of the number of lines that are contents-indexed appears in the list.

The Floating Button Box

In a single-monitor system, it is very inconvenient to have to keep swapping the program and PowerPoint back and forth to scan for slides of interest.  To simplify this, a "floating button box" allows for easy perusal of the presentation.  This is selected from View > Floating Buttons or by using the Launch Floating Buttons button.

The floating button box looks like this:

This window is "always on top" and therefore will always be available.   Unlike the main display, which has four buttons for each category it supports, this little control has only four buttons and you select the category from the dropdown list.  In the case where there is an error or other annotation, the text of the error or annotation is in another little floating window.  The position of the text adjusts if the button box is near a window border so the entire string is visible on the monitor that holds the button box, although I made no effort to deal with a button box that itself is split across monitors.

The "Hide Indexer" button stops the indexer main screen from popping up every time a control is activated in the floating button box.  This is visually very annoying if the button box is used in a single-monitor system.  However, in a dual-monitor system, it is still convenient to use the button box on the primary monitor, but there is no need to hide the indexer application, so I made this an explicit user action to hide it.

The Skip Words List

The Words to Skip list contains the list of all character sequences that will not appear in the index.

The number on the top right of the control indicates the number of words in the skip list.

The * appears at the top left to indicate that the contents of the list have been modified, but have not been saved.  They can be saved using the Save button.  This will save the contents to the current .nul file.  If there is no current .nul file, you will be prompted for a filename.

You can also use the File>Save Null Words menu item, which does the same thing as the button, or the File>Save Null Words As... menu item to change the name of the .nul file.

Words can be added manually, although this is uncommon, by typing the word into the top left edit control, and clicking the Add button.  A word can be removed entirely from the list by selecting the word (or words) to be deleted, and clicking the Remove button.

The entire contents of the control can be removed by clicking the Remove All button.  Note that saving the file will destroy the contents of the existing file.

Transferring words between the two lists

The two arrows between the Words Found list and the Skip Words list will transfer words back and forth between the two lists. 

Because it is so common to want to transfer words to the null-word list, a double-click of a word will do this.

You can select more than one term in the null-word list and click the right arrow to transfer them all back across.

Permutation/Phrase creation

When the *PHRASE or *PERMUTE option is selected in the rules dropdown, the button between the index terms and the rules is given a caption to indicate which was selected.  When the button is clicked, and a dialog is popped up that allows you to type the specification of a phrase to be indexed or a permutation rule to be applied.

Note that although it also appears below the arrow button, this is redundant, because the arrow will never be enabled.

The Transformation Rules Display

This displays the selected words that will have transformations applied.

The Rule dropdown allows you to select which rule is applied.  In addition to the preset rules, a specific replacement rule string can be typed in.

The * at the top left corner of the list indicates that the rule list has been modified and has not yet been saved.  The rules can be saved by clicking the Save button.  Alternatively, the File>Save Rules File menu item will save it as the default file name. If no default file name is given, you will be prompted for a new file name.  The File>Save Rules File As... will always prompt for a name.

A rules file can be loaded by using the File>Open Rules File menu item.  This will delete all the current rules, and replace them with the contents of the file.

The upper arrow will use the selected word from the Words Found list to create a rule in the Transformation Rules list.  The fule that applies is the rule shown below the arrow (this is a duplication of the word in the Rules box, but is conveniently placed for visibility.  The operation will prompt for the desired Word appearance.

The lower arrow will use the selected hyphenated phrase from the Hyphenated Words list to create a rule in the Transformation Rules list.  The operation will prompt for the desired Word appearance.

Rules are added to the end of the list in the order they are created.  However, clicking on either of the headers will sort the list by either Word or Rule.  The rules file will be written out in whatever order they are currently sorted.

Clicking on the column heading will sort the column alphabetically.  The Rule column will sub-sort the Word column alphabetically.

The small box shown as is the delete-element box.  If an entry is selected in the list box, this will change to and clicking it will delete the rule.

Note that there is currently no "undo" capability when rules are deleted.  Once they are deleted, they are gone.

To create a replacement rule, where an indexed word is replaced by another string entirely, select the desired word in the word list, then type the desired replacement word into the Rule window.  The text of the rule will appear below the arrow.  The result of this will create a rule of the form
debugport=/debugport

These rules are always an exact match basis; there is no automatic capitalized version created.

The Notes Editor

It is tedious to have to retype things in the Speaker's Notes section when the computer already has most of the information.  The Notes Editor section allows you to easily add new entries to the Notes rules.  These entries will always be appended to the Notes section, so you will have to change the order by manual editing of the PowerPoint presentation if the order matters.

Whenever a word or phrase is clicked the the "Words found in presentation" box, it is transferred to the edit control of the Append to Notes area.

For the controls in this area to be enabled, there must at least be a currently-selected slide.  See the Sync to PowerPoint option to make this automatic.  Otherwise double-click on a line of the text display to select that slide.

 

The Transformation Rule Prompt

When a transfer is done, a set of canonical representations is chosen.  The selected representation determines the matching rules applied to the input word to determine if the transformation should take place.

In accordance with the discussion of case sensitivity of the rules (see below), a word which appears entirely in lowercase will automatically generate a Capitalized replacement rule.  This is indicated in this display by showing both rules.  Not all transformations allow the creation of a Capitalized rule, and for those transformations, the alternate will not be displayed.

 The first selection is the word as it appears in the Words Found list. 

The second selection is the word as it is capitalized. 

The third selection is the word as it is entirely in uppercase.

The fourth selection is the word as it appears entirely in lowercase.  If the rule allows an all-lower-case word to have a Capitalized alternate, this is shown.

The last entry allows you to specify any other casification or pattern you want.

Phrase and Permutation Prompts

When it is desirable to index a permutation, the *PERMUTE option can be selected, and this enables the button.  When the button is clicked, a dialog box is shown.  The phrase to be permuted can be typed in.

The phrase must be typed in as it appears in the document.  The lines in the edit control show how it will be indexed.  The match is always case-insensitive, and upper-case letters cannot be typed into the control.

The OK button will not be enabled until at least two words are typed.  It will be disabled if more than three components (hyphenated phrases count as a single "component") are typed in.

When a phrase is to be indexed without permutations, the *PHRASE option can be selected.  The button will bring up the dialog shown.  The phrase to be indexed can be typed in. 

The phrase must be typed in as it appears in the document.  The match is always case-insensitive, and upper-case letters cannot be typed into the control.

The OK button will not be enabled until at least two words are typed.  The phrase can contain as many words as needed.

The Save Modified Presentations dialog

Because it is possible to have multiple presentations indexed, multiple presentations can be modified.  If a presentation is modified, when an attempt is made to close out the current indexing task (or exit the program), you will be prompted to save any changed presentations.  Note that if you do not save them here, you will have the option of saving them from PowerPoint, but that is often less convenient because PowerPoint will prompt itself.

The Save modified presentations dialog lists all the presentations that have changed.

The Save button will cause all the selected presentations to be saved.  The Don't Save button will not bother to save the presentations but continue the operation that caused this dialog to pop up (such as opening a new indexing task or exiting the application).  The Cancel button will cancel the operation, returning you to the point where you started the operation that requested the saving.

To simplify setting or clearing options, the Select All and Unselect All buttons will check or uncheck all the names that appear in the list box.

The dialog can be resized to show longer names or more names.

Note: there seems to be some strangeness about the PowerPoint automation interface; it erroneously returns TRUE to indicate that a presentation has changed even though the presentation has already been saved explicitly in PowerPoint.  I have no idea why it does this, so sometimes this prompt is spurious.

Transformation Rules

The rules that I've created are directed towards English-language presentations.  However, the set is extensible, and for any language for which there can be a consistent set of transformations from past to present forms or plural to singular forms, the set can be easily extended.

The rules file is a text file, whose syntax is as shown in the Example rule column.

The rules are normally case-sensitive, and are applied when a word is found.  In the absence of a transformation rule, the hardwired transformation rules discussed below are applied.  However, to limit the number of rules typed for common cases, the *DROP rules have a specific generalization: if the first letter is lowercase and there are no uppercase letters in the string, an alternative form with an uppercase letter is generated.  This handles the situation in many languages similar to English where the first word of a sentence is capitalized.  It would not be necessary for a language like German where nouns are uniformly capitalized.

Rule Example rule input result notes
*DROP_D iced=*DROP_D iced
Iced
ice
Ice
If the first character is a lower-case letter, and there are no other uppercase letters in the word, a secondary rule is created that has the first letter uppercased.
Iced=*DROP_D iced
Ice
iced
Ice
If the first character is an upper-case latter, no additional rule is generated.
*DROP_ES suffix=*DROP_ES suffixes
Suffixes
suffix
Suffix
If the first character is a lower-case letter, and there are no other uppercase letters in the word, a secondary rule is created that has the first letter uppercased.
Suffix=*DROP_ES suffixes
Suffixes
suffixes
Suffix
If the first character is an upper-case latter, no additional rule is generated.
*DROP_ED worked=*DROP_ED worked
Worked
worked
Work
If the first character is a lower-case letter, and there are no other uppercase letters in the word, a secondary rule is created that has the first letter uppercased.
Worked=*DROP_ED worked
Worked
worked
Work
If the first character is an upper-case letter, no additional rule is generated.
*DROP_ING working=*DROP_ING working
Working
work
Work
If the first character is a lower-case letter, and there are no other uppercase letters in the word, a secondary rule is created that has the first letter uppercased
dropping=*DROP_ING dropping
Dropping
drop
Drop
If the result of applying this rule is to create a word whose two last letters are the same, the duplicate letter is dropped.
Working=*DROP_ING working
Working
working
Work
If the first character is an upper-case letter, no additional rule is generated
*DROP_LY normally=*DROP_LY normally
Normally
normal
Normal
If the result of applying this rule is to create a word whose two last letters are the same, the duplicate letter is dropped.
Normally=*DROP_LY normally
Normally
normally
Normal
If the first character is an upper-case letter, no additional rule is generated
*DROP_S thing=*DROP_S things
Things
thing
Thing
If the first character is a lower-case letter, and there are no other uppercase letters in the word, a secondary rule is created that has the first letter uppercased.
Thing=*DROP_S things
Things
things
Thing
If the first character is an upper-case latter, no additional rule is generated.
*HEXWORD add=*HEXWORD add
Add
ADD
add
Add
(word is dropped)
Allows the word, which might be misinterpreted as a hex value, to be considered as a word.  Note that if the pattern is all lowercase, it applies to both lowercase and capitalized versions.  To get the all-upper-case version, an all-upper-case rule must also be added.
Add=*HEXWORD add
Add
ADD
(word is dropped)
Add
(word is dropped)
If the pattern is capitalized, only the capitalized version will be recognized.  To get the all-upper-case version, an all-upper-case rule must also be added.
ADD=*HEXWORD add
Add
ADD
(word is dropped)
(word is dropped)
ADD
This form of rule only allows an all-upper-case version to be recognized.  To get both capitalized and all-lower-case recognized, an all-lower-case rule must be added.  To get capitalized-only to be recognized, a capitalized rule must be added.
*HYPHEN common–things=*HYPHEN common-things
Common-things
Common-things
Common-things
Does not treat hyphen as a delimiter between two words, but treats the hyphenated phrase as if it were a single word.  If the first character is a lower-case letter, and there are no other uppercase letters in the hyphenated phrases, a secondary rule is created that has the first letter uppercased.

Note that individual words of a hyphenated phrase are not indexed independently.

Common–things=*HYPHEN common-things
Common-things
common things
Common-things
If the first letter is upper-case, the rule only matches exactly; the first case, which is not an exact match, is now treated as two separate words
*IES_TO_Y tries=*IES_TO_Y tries
Tries
try
Try
If the first character is a lower-case letter, and there are no other uppercase letters in the word, a secondary rule is created that has the first letter uppercased
Tries=*IES_TO_Y tries
Tries
tries
Try
If the first character is an upper-case letter, no additional rule is generated
*NORMAL cat=*NORMAL cat Cat Note this is indistinguishable from the default behavior; it is more interesting in other cases
caT, cAt, cAT, Cat, CaT, CAt, CAT  caT, cAt, cAT, Cat, CaT, CAt, CAT The match must be exact.
cAt=*NORMAL cAt Cat This handles some weird cases that arise from programming.
cat, caT, cAT, Cat, CaT, CAt, CAT cat, caT, cAT, Cat, CaT, CAt, CAT  
CAT=*NORMAL CAT Cat Particularly useful for converting all-caps letters in titles and section headings to indexable entities.
*PERMUTE memory allocation=*PERMUTE memory allocation
Memory allocation
Memory Allocation
MEMORY ALLOCATION
Memory allocation
Allocation, memory
Rule must be all lower-case, but does a case-independent match. Allows multiword phrases to be recognized and indexed under multiple permutations. Note that only two-word and three-word permutations are currently supported.

Note that all words in the permutation will still be individually indexed, unless they are added to the null-word list.

Permutations may also be supplied in Notes Rules.

finite state machine=*PERMUTE finite state machine
Finite state machine
FINITE STATE MACHINE
Finite state machine
State machine, finite
Machine, finite state
*PHRASE memory allocation=*PHRASE memory allocation
Memory allocation
Memory Allocation
MEMORY ALLOCATION
Memory allocation Rule must be all lower-case, but does a case-independent match. Allows multiword phrases to be recognized and indexed.

Note that all words in the phrase will still be individually indexed, unless they are added to the null-word list

Phrases may also be supplied in Notes Rules.

*TOUPPER cat=*TOUPPER cat CAT An exact match is required
caT, cAt, cAT, Cat, CaT, CAt, CAT caT, cAt, cAT, Cat, CaT, CAt, CAT Strings which are not exact matches are ignored
cAt=*TOUPPER cAt CAT  
cat, caT, cAt, Cat, CaT, CAt, CAT cat, caT, cAt, Cat, CaT, CAt, CAT  
*TOLOWER CAT=*TOLOWER CAT cat  
cat, caT, cAt, cAT, Cat, CaT, CAt cat, caT, cAt, cAT, Cat, CaT, CAt  
cAt=*TOLOWER cAt cat  
cat, caT, cAT, Cat, CaT, CAt, CAT cat, caT, cAT, Cat, CaT, CAt, CAT  
*UNCHANGED strcpy=*UNCHANGED strcpy strpcy Requires exact match; the text is unchanged
replacement rule running=run running
Running
run
Running
Requires an exact match
Running=run running
Running
running
run
Note that the replacement is exact, there is no case change in the replacement
Running=Run running
Running
running
Run
 
default behavior this is what happens if no rule is found.  This behavior is hard-wired into the program. anything
Anything
Anything
Anything
A word which is not otherwise handled will have its first letter uppercased, providing all the remaining symbols are lower-case letters
anyThing anyThing Unchanged, has uppercase letter within it
word322 word322 Unchanged, has digits within it
some_program_name some_program_name Unchanged, has non-letter within it
www.flounder.com www.flounder.com Unchanged, has non-letter within it
’s file's
1980’s
file
1980
English possessive rule, and in odd cases, an English plural rule (formally, non-words which are plural should have ’s, for example, LPT’s is the official plural of LPT, although I think this convention is silly and I ignore it)
've
'd
't
I've
Should've
can't
not indexed

Note that if for some reason it was desirable to get the effect that things=>thing but Things should not be transformed, the the *DROP_S rule would not work, but an exact replacement rule things=thing, would accomplish the task.

Character Rules (1)

^K ('\x0B', chr$(11)) Converted to a space.  This is the shift-return character in multiline entries.
character drops The following characters are ignored, and are treated as if they are whitespace.  Their presence in a word splits the word into two pieces.  Note that this prevents the use of hyphenated words such as pre-computed, all-encompassing, and so on
    ~ ! $ % ^ & * ( ) + = [ ] { } | ; " ' < > ? , . … “ ” ‘ ’
character trims The following characters will be trimmed from the beginning or end of words
    . space ...
  The following characters will be trimmed from the end of words
    space

Integrity-preserving rules

At this point, the resulting word is tested to see if it starts with "http:" "https:", "d:" for any lowercase or uppercase letter d, or "www.".  If it does, it is treated as a single word.  Note that this does not permit ( or ) inside a URL, but only profoundly antisocial Web sites use those.  If it is not a URL, the following rules are applied

Example phrase Effect
http://here/there.htm phrase is unchanged
me@company-name.com phrase is unchanged
https://secure.pages/secret.htm phrase is unchanged
m:\filepath\name.ext phrase is unchanged
\\name\dev phrase is unchanged

Character Rules (2)

If the integrity rules were not applied, the following rules are applied:

Rule Example/effect
character drops The following characters are ignored and treated as if they are whitespace
  : / - \x96 \x97 \\
  \x96 is an en-dash, and \x97 is an m-dash
string length < 2 Single-character results are not indexed

Drop rules

The drop rules are applied after the above canonicalization.  This is important because otherwise there would be the need to create drop rules for singular and plural forms, various tenses, and so on.

The goal here was to deal with PowerPoint presentations of programs.  So the drop rules are somewhat ad hoc for this purpose. 

Rule Example Interpretation
numbers 123
123L
123l
123.
0.0
.123
123E+14
-123e-12
A sequence of digits, a decimal number, a floating number, are all rejected.  A decimal number that ends in lower-case or upper-case L is considered to be a number.
(currently, U/u suffixes are not supported)
hex numbers 0x Any string starting with 0x or 0X is not indexed
bc07 A string that looks like a hex number is not indexed (see note)
bc07L A hex number that ends with an upper-case or lower-case L.
0h A sequence of hex digits ending in h is not indexed
English words or computer terms that might be misinterpreted as hex numbers will be dropped, unless there is a *HEXWORD exception to them.  This include words like "cafe", "face", "add", and "ACL".  The latter may come as a surprise, but it is the hex value AC followed by the Long constant indicator.
’t (suffix) hadn't Handles English contractions can't, won't, shouldn't, and so on
’ve (suffix) could've Handles English contractions should've, could've, and so on
I’ (prefix) I'm Handles English contractions on I, such as I'm, I'd, I'll

Sorting rules

While for purposes of rules, rules are treated as case-sensitive, this has problems in producing an alphabetical index, because the index would be sorted as A..Za..z, meaning "thing" comes out in an entirely different place than "Thing".  This is counterintuitive insofar as user expectations.  So the index-sorting algorithm is case-independent.

If the word starts with a number (that is, it is not just a number, which is a rule that results in a drop), then it is sorted in its integer position, and the suffix is sorted alphabetically.  Thus the sequence that would sort alphabetically as

100MB
10ms
1GB
1ms
20KHz
2GB
33rpm
4.7MHz

will actually sort in increasing numerical order, as

1GB
1ms
2GB
4.7MHz
10ms
20KHz
33rpm
100MB

Notes Rules

These rules appear in the Slide Notes of a slide.  They may appear with other lines; if you are using the notes, you can put these at the end.  The actual notes contents are ignored, unless a line starts with a * and has one of the keywords shown in the table, followed by an equal sign.  There cannot be any spaces between the * and the keyword and the keyword and the =.

By using this, you can put these indexing marks after your Speaker Notes, if you use them, and therefore they will not interfere with your use of notes.  By requiring the use of the * as the first character in the line (it must not be preceded by a space or any other character), it is unlikely that your notes will "accidentally" contain something that looks like one of these directives.

Note the default values are established at the time when the presentation is opened using the settings *CONTENTS=1 and *NOTES=1, unless you override the setting explicitly in the File Open dialog.

Keyword Example Appears in index as Notes
*ALIAS=name *ALIAS=book [book] 1, 2, 7-9 Reports the current file as the specified alias in the index
*CONTENTS=0|1 *CONTENTS=0   Turns off content indexing for the current slide and all subsequent slides
*CONTENTS=1   Turns on contents indexing for the current slide and all subsequent slides
*HEADING1=text *HEADING1=Introduction Table of contents 1st-level entry These will be processed even if *USECONTENTS=0
*HEADING2=text *HEADING2=Detail Table of contents 2nd-level entry
*HEADING3=text *HEADING3=Subdetail Table of contents 3rd-level entry
*INCLUDE=filename *INCLUDE=something.ppt [something] 1, 2, 7-9 Indexes the specified file
*INCLUDE=filename>alias *INCLUDE=c:\presentations\something.ppt>main [main] 1, 2, 7-9 Uses the alias name for display of the index information
*NOTES=0|1 *NOTES=0   Turns off notes processing starting at this point in the notes and for all subsequent slides
This does not affect *TODO, *HEADING1 and *HEADING2 elements
*NOTES=1   Turns on notes processing starting at this point in the notes and for all subsequent slides
*PHRASE=term *PHRASE=Any old phrase Any old phrase The term is indexed exactly as shown; there is no case transformation.  This is similar to the *PHRASE rule.
*PHRASE=Thing, Large Thing
    Large
Any phrase separated by commas indicates a multilevel index item
*PHRASE=Cat,, Little Gray, nice Cat, Little Gray
    nice
A double comma causes the comma to be treated as an ordinary character, not as a level separator
*PERMUTE=phrase *PERMUTE=Conditional compilation Compilation
    conditional
Conditional compilation
The terms are normalized according to the rules of permutation.  Note that case normalization rules apply as for the *PERMUTE keyword.

As with the *PERMUTE rule, only two-word and three-word phrases are handled; permutations will be case-transformed according to the same rules use for the *PERMUTE rule.

*SKIPCONTENTS=0|1 *SKIPCONTENTS=1   Does not index the contents of this slide only.  Upon the completion of the slide, the contents flag is reset to the value decided by the most recent *CONTENTS= or set by the File Open dialog, whatever is prevailing.
*SKIPCONTENTS=0   Accepted but has no meaning
*TODO=text *TODO=Check consistency of 'Threads' Note added to display This will be processed even if *USENOTES=0
*USECONTENTS=0|1 *USECONTENTS=1   Forces indexing the contents of this slide only.  Upon completion of the slide, the contents flag is reset to the value decided by the most recent *CONTENTS= or set by the File Open dialog, whatever is prevailing.
*USECONTENTS=0   Accepted but has no meaning

These rules can be added directly from the Indexer by the Append to Notes controls in the Notes Editor section.

When a line is displayed in the table that was added by a Notes Rule, it is highlighted in a different color.

Multi-file indexing: *INCLUDE=

When multiple PowerPoint presentations exist, it is nice to be able to index all of them together.  This allows for multiple presentations to be indexed with a single index, and also encourages uniform indexing across presentations even if they are indexed separately.

I thought of several approaches to this, but finally settled on the notion of using a directive, *INCLUDE=filename, which would appear in the Notes section.  This means that to index a collection of presentations, it may be necessary to create a "master" PowerPoint presentation which has exactly one slide, which contains a set of *INCLUDE directives to the other presentations.

An attempt to create a circular *INCLUDE dependency will be detected, and the *INCLUDE that would result in circular inclusion will be ignored.  It will be flagged as an error line.

Specifying file names

A filename may be specified as an absolute path, a relative path, or with no path.

Path type Example File name actually used
Absolute path name c:\path\whatever.ppt c:\path\whatever.ppt
  \path\whatever.ppt \path\whatever.ppt
  \\servername\sharename\path\whatever.ppt \\servername\sharename\path\whatever.ppt
Relative path name ..\whatever.ppt A name based on the directory in which the master presentation is found; if we started in
c:\mypresentations\somevent\master.ppt
then the resulting path would be
c:\\mypresentations\someevent\..\whatever.ppt

If, for some reason, the path of the master presentation is empty, the current working directory is substituted. 

Plain file name whatever.ppt

The main Display

The main display has several interesting items displayed when there are multiple files.

Note in the Slide Locator, the alias names are displayed before the sequences of related slide numbers.  In the bottom display, the alias names are listed in the left column of the display.  The File Finder control is active, and will reflect what file is selected by displaying its alias name, but also, selecting an alias name in this box will position you at the first slide for that file.

The File Finder

This lists all the files which have been indexed.  It is disabled if only one file has been indexed (no *INCLUDE directives).  However, with multiple presentations being indexed, selecting the name here (the alias name), the index window will be scrolled to the first line indexed for that presentation..

Aliases

File names can be very long.  When the page numbers are shown in the index, they will be preceded by the "alias name".  By default, the alias name is the filename part of the file (excludes the path and the .ppt).  However, even a long file name will be inconvenient to have in the index.  Therefore, a file can have a "short" name, an alias.

The alias can be supplied by two different methods.  One can be when the *INCLUDE= directive is given.  The syntax is

*INCLUDE=filename>alias

and any reference to the filename that is needed in the index uses the alias.

The other method is to supply, in a file, the directive

*ALIAS=aliasname

An error will be generated if there is an attempt to assign more than one alias to a file, and only the first alias will be used.

The "file" column listed in the bottom list control is the "alias name".  However, hovering the mouse over the alias name will display the complete file name as a flyover help window:

The progress dialog

With the notion of nested files, the only single progress bar no longer captures what is happening. Therefore, I created a modeless progress dialog that shows the progress through nested files.  This shows that the one-and-only slide of the master PowerPoint index presentation has been read, but we are only partway through the detailed study of one of the included files.  If the included file contains another included file, then a third progress bar would be shown showing the nested progress.

The index

The index is decorated with the name of the file, or its alias, to indicate which presentation the words are found in.  For example, one of my presentations comes with a text and supplemental lab notes.  Both are indexed by doing

*INCLUDE=XP Driver Course 14.ppt>text
*INCLUDE=XP Driver Course Lab 14.ppt>lab

An excerpt from the output is

STATUS_ACCESS_VIOLATION [text] 1250

STATUS_BUFFER_OVERFLOW [lab] 47, [text] 363

STATUS_BUFFER_TOO_SMALL [lab] 47, [text] 363, 364

STATUS_CANCELLED [text] 746, 1621, 1622

STATUS_DATA_ERROR [lab] 47, [text] 363

STATUS_DATA_LATE_ERROR [lab] 47, [text] 363

STATUS_DATA_OVERRUN [lab] 47, [text] 363

STATUS_DATATYPE_MISALIGNMENT [text] 1251

STATUS_DELETE_PENDING [text] 1191, 1213

STATUS_DEVICE_CONFIGURATION_ERROR [lab] 47, [text] 363

STATUS_INSUFFICIENT_RESOURCES [lab] 47, [text] 363, 572-1185

STATUS_INVALID_BUFFER_SIZE [lab] 47, [text] 363

The Table of Contents

The Table Of Contents (TOC) now contains a large dividing bar showing the split where each file is included.  The TOC order is the order-of-inclusion, so if you are creating a "master indexing file" to name all your presentations, the order in which you place the *INCLUDE directives is the order they will appear in the TOC.  However, if a file has no *HEADINGn directives, it will not appear in the TOC at all.

Table of Contents

  lab       text  

text  Ý

1

Introduction

15

Device Drivers

21

NT Overview

46

Application Level I/O

  ...

lab  Ý

2

WinDbg

22

Lab setup

24

Lab Resources

26

Lab 1

  ...

File name/extension summary

Name.Extension Type Contents
name.ppt Input PowerPoint presentation
name.nul Input/Output The list of null-words (words which are not indexed)
name.rule Input/Output The list of transformation and substitution rules
name.htm Output Main index file
name_ix.htm Output Left frame file, contains alphabetical index
name_ct.htm Output Right frame file, contains actual index text
name_toc.htm Output Right frame file, contains table of contents
name_hdr.htm Output Header frame
name_ftr.htm Output Footer frame, when present

About the indexer

See a sample index!

Here's a rather poor-quality screen shot of the kind index produced.  But click on it to go to a live index!

 

(Single file version)download.gif (1234 bytes) Source Code Download: Get the complete source.  The source is a Visual Studio .NET 2003 MFC project.  Build it from scratch.  Note that this is for the original version, which did not support many of the features here including indexing multiple presentations.

Want to learn more about the inner workings of the program?  Go here to see an article on the implementation!

(Single file version)download.gif (1234 bytes)

Executable Download: Not a programmer?  Don't have Visual Studio? Trust me enough?  This is the .msi file that will install the executable on your machine.
(Multifile version)download.gif (1234 bytes) Source Code Download: Get the complete source.  The source is a Visual Studio .NET 2003 MFC project.  Build it from scratch.  This is the version described by this page.  However, it is still officially a "beta" release.  Please report any problems you have with it.
(Multifile version)download.gif (1234 bytes) Executable Download:  Not a programmer?  Don't have Visual Studio?  Trust me enough?  This is the .msi file that will install the new executable on your machine.  Note that if you have the Single File version already installed, you will have to uninstall it before installing this version.  If you have version 2.0, you will have to uninstall it to install version 2.2.

Want to learn more about the inner workings of the program?  Go here to see an article on the implementation!

Future work

I already have ideas for future work.  These include

Changes

For a complete change log based on detailed point release versions, check here.

Date Changes
27-Sep-07 Version 1.2 released
12-Oct-07 Version 2 released.  New features include
  • Double-clicking on a line in the bottom display will position you in PowerPoint on that slide number
  • Using the Slide Locator can automatically track to the slide number
  • The VBA script is now gone, and the program can directly read a PowerPoint presentation
2-Nov-07 Added *HEADING1 and *HEADING2 keywords and table-of-contents output
5-Nov-07 Added *TODO and error displays
7-Nov-07 Added multilevel indexing
6-Jan-08 Added *HEADING3
Added *INCLUDE
Added floating button box (makes it easier to use on single-monitor systems)
Added header and footer to frame set of generated index (footer is optional and only present in multi-presentation indexing tasks)
Added file-finder dropdown
Added flyover help for alias-to-filename, error messages, and lengthy word list, skip list, and hyphenation list
List boxes have flyover help for long lines
Better error detection for out-of-range numeric values, non-numeric text in numeric values, and omitted text
Added quick-search capability to the word list
29-Dec-09 Cleaned up several bugs
Added ability to include all slide titles in Table-Of-Contents
Added "Read HTML" button to launch browser or refresh existing presentation
30-Dec-09 Version 2.2 released

[Dividing Line Image]

The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Send mail to newcomer@flounder.com with questions or comments about this web site.
Copyright © 2007 FlounderCraft Ltd./The Joseph M. Newcomer Co. All Rights Reserved
Last modified: May 14, 2011