Searching Dirty to Find What's Hidden

Home > Internet Research Articles > Searching Dirty to Find What's Hidden

Searching Dirty to Find What's Hidden

Genie Tyburski, Web Manager, The Virtual Chase

27 August 2007. Several weeks ago, a legal advocacy group issued a press release, which informed about the organization's efforts on behalf of teenage girls who had been abused in a detention center. It referred readers to a redacted document on its Web site for more information.

As the mother of a teenage girl, it sparked my interest. I displayed the redacted PDF document, and then examined its security. Since I was able to discover the names of the girls, I informed the group, who quickly corrected the flawed document.

But what if my motives were not that of a curious and outraged parent?

Stories about improperly redacted documents appear frequently in the news and legal literature. Often, those who discover the redacted information expose it. But the motives of researchers run the gamut from mild curiosity to winning at all costs. Thus, while exposure might not be desirable, use of the information without the creator's knowledge or consent could be worse.

As was the case in this example, such findings often involve serendipity. But luck isn't always a factor. Strategy plays a major role in certain types of research; for instance, competitive intelligence. It behooves companies to learn about these techniques in order to protect their confidential information.

Private - Keep Out!

When researchers want to know something about a company, one of the first places they check is its Web site. They read what the company wants them to know. Then, if they want to dig deeper continuing to use the company itself as a source, they check two things: the Internet Archive and the Web site robots exclusion file (robots.txt).

The former archives previous versions of the site. As I relate in an earlier article, these sometimes shed light on information the company might not want to reveal.

Because of improved security at Web sites, robots exclusion files generally are not as helpful as they used to be. But researchers still check them, and so should you.

The files contain commands that instruct search engines about areas of the site they should not index. Any legitimate search engine will obey these commands.

To work correctly, the file must appear in the root directory of the Web site. It must bear the filename, robots.txt. Therefore, to find it, you enter: http://www.domain.com/robots.txt.

They are easy to read. The one on The Virtual Chase looks, in part, like this:

user-agent: *
disallow: /_private/
disallow: /cas/
disallow: /cir/
disallow: /data/

The user-agent is the targeted crawler (search engine). The asterisk is a wildcard. Each character string following the command, disallow, is a subdirectory. Consequently, this abbreviated set of commands tells all search engines not to crawl the subdirectories labeled, _private, cas, cir and data. A researcher, of course, will choose to attempt entry, or not.

It's like placing a Keep Out sign on a door. If the door isn't locked, someone may walk through it.

Careless Clues

As I explain in the above-referenced article on the Internet Archive, a prospective client approached a group of my firm's lawyers about launching a new business in an industry with an unsavory reputation. One of the conditions for considering representation was that the woman not have prior dealings in the industry. She claimed she did not.

Research at the client's business Web site in the Internet Archive, however, uncovered circumstantial evidence of several connections. Through telephone research and public records, we were able to verify that not only was she working in the industry, she was the subject of a then-active federal criminal investigation.

Clues about information you would rather researchers not discover often come from the company itself. In a recent and widely publicized example, Google inadvertently released information about its finances and future product plans in a PowerPoint presentation.

Searching for Microsoft Office files is, in fact, an expert research strategy because the meta data often reveals information the producer did not intend to share. You may tack on a qualifier or use a search engine's advanced search page to limit results to specific file types, such as Word documents (doc), PowerPoint presentations (ppt) or Excel spreadsheets (xls).

At Google, the qualifier is filetype: whereas at Yahoo it is originurlextension:. Enter the file extension immediately after the colon (no spaces). Check each search engine's help documentation for the appropriate qualifier, or consult a Web site, such as Search Engine Showdown, which tracks and informs about such commands.

Searching certain phrases sometimes produces intriguing results. Try the phrases below individually to discover the potential for this technique when coupled with a company, organization or agency name:

"not for public dissemination"
"not for public release"
"official use only" (variations include FOUO and U//FOUO)
"company confidential"
"internal use only"

You might find additional ideas for searching dirty in the Google Hacking Database.

Copyright Underground

Book search engines, such as Amazon.com's Search-Inside-This-Book, Google Book Search and the Text Archive at the Internet Archive, are becoming increasingly valuable in research. If you uncover even a snippet of relevant information online, it may save you valuable research time offline.

One of my recent success stories involves finding an entire chapter on the target company in a book published just a few months prior to the research. Of course, I was unable to read the book online. I had to purchase it. But the tools helped me find what I might have missed without them.

However, this is not the underground to which I refer. By using these tools, you are not skirting the process for rewarding those who wrote and published the book.

The underground, while eminently real, is not so much a place as it is a mindset - one that sets information free. The result is a mixed bag of commercial products, including books, music, digital artwork, movies and software, that have been copied or reverse engineered.

Try the search strategy below. Replace the phrase, Harry Potter, with the keywords of your choice:

"index of" "last modified size description" "parent directory" "harry potter"

The portion of the search statement preceding "harry potter" comprises a strategy for finding vulnerable Web sites or servers. In a nutshell, it commands the search engine to return matches to directory structures instead of single Web pages. If a Web site is properly secured, the search engine will be unable to provide this information.

To some extent, you can monitor the availability of files that comprise unauthorized copies of products by setting up what Tara Calishain calls information traps. Tara's excellent book on Information Trapping provides many examples of ways to monitor new information.

One possibility is to use the above search strategy for best-selling or popular products, and then set up a Google Alert for new matches to each query.

While you should monitor hits at other search engines besides Google, doing so requires more work. First, test and perfect the query so that you are retrieving useful results. Set the search engine preferences to retrieve 100 items per page. Then copy the URL when the search results display. Paste it into a page monitor, such as Website-Watcher or TrackEngine. The tracking software or Web service will monitor changes in the first 100 search results. You may opt to have it send the changes to you by e-mail.

Companies and other organizations that want to protect proprietary or confidential information should conduct this type of research with regularity. You can expedite some of the search process with information traps. But considering the stakes, regular thorough searching is a worthwhile investment.

Teaching Legal Professionals How To Do Research