27 August 2007.
Several weeks ago, a legal advocacy group
issued a press release, which informed about
the organization's efforts on behalf of
teenage girls who had been abused in a
detention center. It referred readers to a
redacted document on its Web site for more information.
As the
mother of a teenage girl, it sparked my interest. I
displayed the redacted PDF
document, and then examined its security. Since I was
able to discover the names of the girls, I informed the
group, who quickly corrected the flawed document.
But what if my motives were not
that of a curious and outraged parent?
Stories about improperly redacted documents
appear frequently in the news and legal literature. Often, those
who discover the redacted information expose it. But
the motives of researchers run the gamut from mild
curiosity to winning at all costs. Thus, while exposure
might not be desirable, use of the
information without the creator's knowledge or consent
could be
worse.
As was the case in this example,
such findings often involve serendipity. But luck isn't
always a factor. Strategy plays a major role in certain
types of research; for
instance, competitive intelligence. It behooves
companies to learn about these techniques in order to
protect their confidential information.
Private - Keep Out!
When researchers want to know something about a company, one
of the first places they check is its Web site. They read
what the company wants them to know. Then, if they want to
dig deeper continuing to use the company itself as a
source, they check two things: the
Internet Archive and the Web site robots exclusion
file (robots.txt).
The former archives previous versions of the site. As
I relate in an earlier
article, these sometimes shed light on information
the company might not want to reveal.
Because of improved security at Web sites, robots exclusion files
generally are not as helpful as they used to be. But
researchers still check them, and so should you.
The files contain commands that instruct search
engines about areas of the site they should not index.
Any legitimate search engine will obey these commands.
To work correctly, the file must appear in the root
directory of the Web site. It must bear the filename,
robots.txt. Therefore, to find it, you enter:
http://www.domain.com/robots.txt.
They are easy to read. The one on The Virtual
Chase looks, in part, like this:
user-agent: *
disallow: /_private/
disallow: /cas/
disallow: /cir/
disallow: /data/
The user-agent is the targeted crawler (search
engine). The asterisk is a wildcard. Each character
string following the command, disallow, is a
subdirectory. Consequently, this abbreviated set of
commands tells all search engines not to crawl the
subdirectories labeled, _private, cas,
cir and data. A researcher, of course, will
choose to attempt entry, or not.
It's like placing a Keep Out sign on a door. If the
door isn't locked, someone may walk through it.
Careless Clues
As I explain in the above-referenced article on the
Internet Archive, a prospective client approached
a group of my firm's lawyers about launching a new
business in an industry with an unsavory reputation. One
of the conditions for considering representation was
that the woman not have prior dealings in the industry.
She claimed she did not.
Research at the client's business Web site in the
Internet Archive, however, uncovered circumstantial
evidence of several connections. Through telephone
research and public records, we were able to verify that
not only was she working in the industry, she was
the subject of a then-active federal criminal
investigation.
Clues about information you would rather researchers
not discover often come from the company itself. In a
recent and widely
publicized example, Google inadvertently released
information about its finances and future product plans
in a PowerPoint presentation.
Searching for Microsoft Office files is, in fact, an
expert
research strategy because the meta data often reveals
information the producer did not intend to share. You
may tack on a qualifier or use a search engine's
advanced search page to limit results to specific file
types, such as Word documents (doc), PowerPoint
presentations (ppt) or Excel spreadsheets (xls).
At Google, the qualifier is
filetype: whereas at Yahoo it is
originurlextension:.
Enter the file extension immediately after the colon (no
spaces). Check each search engine's help documentation
for the appropriate qualifier, or consult a Web site,
such as
Search Engine Showdown, which tracks and informs about
such commands.
Searching certain phrases
sometimes produces intriguing results. Try the phrases
below
individually to discover the potential for this technique
when coupled with a company, organization or agency name:
You might find additional ideas
for searching dirty in the
Google Hacking Database.
Copyright Underground
Book search engines, such as
Amazon.com's
Search-Inside-This-Book,
Google Book Search and the
Text Archive at the Internet Archive, are
becoming increasingly valuable in research. If you
uncover even a snippet of relevant information online,
it may save you valuable research time offline.
One of my recent success stories involves finding an
entire chapter on the target company in a book published
just a few months prior to the research. Of course, I
was unable to read the book online. I had to purchase
it. But the tools helped me find what I might have
missed without them. However,
this is not the underground to which I refer. By using
these tools, you are not skirting the process for
rewarding those who wrote and published the book.
The underground, while eminently real, is not so much a
place as it is a mindset - one that sets information
free. The result is a mixed bag of commercial products,
including books, music, digital artwork, movies and
software, that have been copied or reverse engineered.
Try the search strategy below.
Replace the phrase, Harry Potter, with the keywords
of your choice:
"index of" "last modified
size description" "parent directory" "harry potter"
The portion
of the search statement preceding "harry potter"
comprises a strategy for finding vulnerable Web sites or
servers. In a
nutshell, it commands the search engine to return
matches to directory structures instead of single Web
pages. If a Web site is properly secured, the search
engine will be unable to provide this information.
To some extent, you can monitor the availability of
files that comprise unauthorized copies of products by setting up what Tara Calishain calls
information traps. Tara's excellent book on
Information Trapping provides many examples of ways
to monitor new information.
One possibility is to use the above search strategy for
best-selling or popular products, and then set up a
Google Alert for new matches to each query.
While you should monitor hits at other search engines
besides Google, doing so requires more work. First, test
and perfect the query so that you are retrieving useful
results. Set the search engine preferences to retrieve
100 items per page. Then copy the URL when the search
results display. Paste it into a page monitor, such as
Website-Watcher or
TrackEngine.
The tracking software or Web service will monitor
changes in the first 100 search results. You may opt to
have it send the changes to you by e-mail.
Companies and other organizations that want to protect
proprietary or confidential information should conduct
this type of research with regularity. You can expedite
some of the search process with information traps. But considering
the stakes, regular thorough searching is a worthwhile investment.
|