The solution I settled on was to compromise by using a blacklist, this enables us to use almost any source document but requires us to create the list of ignored words. In order to do this I used a large document to ‘teach’ it by adding words when they appeared.
Here is the version at the time of posting:
“0,1,2,3,4,5,6,7,8,9,-, a, about, also, an, and, any, are, as, be, been, but, by, come, comes, do, for, from, go, goes, have, he, how, however, i, in, is, it, its, may, of, on, or, our, out, so, such, than, that, the, then, these, they, this, thus, to, too, us, use, was, way, we, what, which, who, with,(,),.,:,;,?,[,],^”