[QueueNews] Document & Media Exploitation
QueueNews
queuenews at acmqueue.com
Mon Dec 17 08:00:02 PST 2007
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
. . . . . . . . . . . . . . . . . . . . . . . .
Queue E-Mail Newsletter
for the Week of Dec/17/2007
. . . . . . . . . . . . . . . . . . . . . . . .
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-------------------------------------------------
Sponsored by
/----------------------------------------------------------\
\----------------------------------------advertisement------/
Latest Articles:
Document & Media Exploitation
The DOMEX challenge is to turn digital bits into actionable
intelligence.
http://acmqueue.com/rd.php?c.512
(scroll down to read an excerpt from this article)
Only Code Has Value?
Even the best-written code can't reveal why it's doing what it's
doing.
http://acmqueue.com/rd.php?c.511
/----------------------------------------------------------\
\----------------------------------------advertisement------/
New article on ACM Queue:
Document & Media Exploitation
http://acmqueue.com/rd.php?c.512
The DOMEX challenge is to turn digital bits into actionable
intelligence.
by Simson L. Garfinkel, Ph.D.
>From the Going Multimedia issue, vol. 5, no. 7 - November/December 2007
article excerpt:
A computer used by Al Qaeda ends up in the hands of a Wall Street
Journal reporter. A laptop from Iran is discovered that contains
details of that country's nuclear weapons program. Photographs and
videos are downloaded from terrorist Web sites.
As evidenced by
these and countless other cases, digital documents and storage devices
hold the key to many ongoing military and criminal investigations. The
most straightforward approach to using these media and documents is to
explore them with ordinary tools - open the word files with Microsoft
Word, view the Web pages with Internet Explorer, and so on.
Although this straightforward approach is easy to understand, it can
miss a lot. Deleted and invisible files can be made visible using basic
forensic tools. Programs called carvers can locate information
that isn't even a complete file and turn it into a form that can be
readily processed. Detailed examination of e-mail headers and log files
can reveal where a computer was used and other computers with which it
came into contact. Linguistic tools can discover multiple documents
that
refer to the same individuals, even though names in the different
documents have different spellings and are in different human
languages.
Data-mining techniques such as cross-drive analysis can reconstruct
social networks - automatically determining, for example, if the
computer's previous user was in contact with known terrorists. This
sort
of advanced analysis is the stuff of DOMEX, the little-known
intelligence practice of document and media exploitation.
The
U.S. intelligence community defines DOMEX as "the processing,
translation, analysis, and dissemination of collected hard-copy
documents and electronic media, which are under the U.S. government's
physical control and are not publicly available."^1 That
definition goes on to exclude "the handling of documents and media
during the collection, initial review, and inventory process." DOMEX is
not about being a digital librarian; it's about being a digital
detective.
Although very little has been disclosed about the
government's DOMEX activities, in recent years academic researchers -
particularly those concerned with electronic privacy - have learned a
great deal about the general process of electronic document and media
exploitation. My interest in DOMEX started while studying data left on
hard drives and memory sticks after files had been deleted or the media
had been "formatted." I built a system to automatically copy the data
off the hard drives, store it on a server, and search for confidential
information. In the process I built a rudimentary DOMEX system. Other
recent academic research in the fields of computer forensics, data
recovery, machine translation, and data mining is also directly
applicable to DOMEX.
This article introduces electronic document
and media exploitation from that academic perspective. It presents a
model for performing this kind of exploitation and discusses some of
the
relevant academic research. Properly done, DOMEX goes far beyond
recovering documents from hard drives and storing them in searchable
archives. Understanding this engineering problem gives insight that
will
be useful for designing any system that works with large amounts of
unstructured, heterogeneous data.
Why "Exploitation?"
When researchers say that their work is centered on information or
document "exploitation," eyebrows invariably raise. The word
exploitation is provocative, attracting unwarranted attention to a
process that could just as easily be classified as "computer forensics"
or even "data recovery." But, in fact, the word is apropos.
The
words exploit and exploitation imply using something
in a manner that's "unfair or selfish."^2 And it's true.
People who are in the business of document and media exploitation
really
do seek to make unfair use of computer documents and electronic storage
devices. Fair, after all, means following the rules. The "rules" of a
computer system are the APIs, the data-storage standards, the file
permissions, and other interfaces that were intended to be used by the
file's creator. When a file in the computer's electronic trash is
deleted by "emptying the trash," the rules say that the file's contents
should no longer be accessible. The "undelete" command that is part of
every forensic toolkit takes advantage of the fact that computer
systems
generally do not overwrite the contents of deleted files. This is a
common problem in computer systems, affecting not only deleted files in
file systems but also deleted paragraphs in word processors and even
unallocated pages in virtual memory systems.
Computer forensic
practitioners working for police departments and litigation support
firms also make their living by recovering intentionally deleted data,
but even these processes follow rules - though those involved in
exploitation might choose to ignore them. The goal of computer
forensics
is to assist in some kind of investigation, which usually begins
because
a crime was committed and, hopefully, ends with the perpetrator being
convicted in a court of law. With conviction as a goal, forensic
practitioners must be concerned with the evidentiary integrity and
chain
of custody - and they need to limit their search to information that is
relevant to that investigation. In many cases the evidence will have
been obtained under a search warrant or discovery procedure, the terms
of which may limit the forensic examiner's actions or even which kinds
of files may be examined. Evidence obtained by breaking the rules may
even be suppressed.
For example, in the case of U.S. v.
Carey, an investigator executing a warrant on narcotics discovered
files with a JPG extension that contained child pornography. Carey was
indicted and convicted for possession of child pornography, but the
appellate court reversed the ruling and remanded the case back to the
trial court, arguing that "the seizure of evidence was beyond the scope
of the warrant."^3 The evidence should have been suppressed.
Unlike the investigators in the Carey case, those engaged in
document and media exploitation are not bound by any rules other than
laws of physics and nature. The goal of information exploitation is to
get and use the data - the ends justify the means. It's OK if these
results aren't good enough for a conviction. Exploitation rarely seeks
to prove or disprove the details of a case; instead, it seeks to make
the fullest use of all the data that has been obtained. The standard of
success is the usefulness of the result, not the reliability of the
process.
If you find the preceding paragraph alarming, remember
that DOMEX is about exploiting data, not people. "Exploitation" is
precisely the attitude that you want when you take a crashed hard drive
to a data-recovery firm. If you've just lost the only copy of a
400-page
manuscript, it's probably OK with you if the firm is able to recover
the
first 200 pages of the September 20 version and the last 180 pages of
the August 19 version. Although a good defense attorney might be able
to
suppress a document that was made by stitching together those two
halves, you probably don't care about that if you are the author and
the
alternative is rewriting the 400 pages from memory. Likewise, if you
are
using some kind of desktop search system to index the files on your
hard
drive, you don't mind if the product makes a mistake or two and shows
you files that you aren't "allowed" to see - just as long as you find
what you're searching for.
Read the rest of this article at acmqueue.com
http://acmqueue.com/rd.php?c.512
See all the latest articles and audio interviews with Queue's RSS Feeds
http://acmqueue.com/rd.php?s.48.7
To unsubscribe to this newsletter, send an email to
queuenews-request at acmqueue.com
with the words 'unsubscribe' in the subject line.
Change your email address
http://www.acmqueue.com/mailman/options/queuenews
Subscribe to Queue in print
http://www.acmqueue.com/click.php?id=30
About Queue
http://acmqueue.com/rd.php?s.31.5
Contact Us
http://acmqueue.com/rd.php?s.32.10
Privacy policy
http://acmqueue.com/rd.php?s.27.9
For advertising information, contact advertising at acmqueue.com
© 2007 ACM, Inc. All rights reserved.
More information about the QueueNews
mailing list