Stern Center for Research Computing

New York University • Leonard Stern School of Business

Hadoop

Hadoop is an open-source project that works in a distributed computing environment on data intensive applications.

A simplistic description of the capability of hadoop would be to take the Linux grep, sort, and uniq filters and have them work in a distributed environment over 1000s of computers and petabytes of data. In general, Hadoop and related projects now extend grid and cluster computing concepts into the text data mining and analysis arena, beyond strictly scientific computing applications.

Stern Research Computing is piloting a small hadoop cluster. If successful, we will try to get funds to enlarge it.
Currently there are 6 processing nodes with about 16 cores and about 1TB of disk.
In addition to hadoop and map-reduce, hive and mahout (0.5) are also installed.

Why is This of Interest for Research?

Hadoop can dramatically increase the ability to do high end business research projects, which have been difficult up to now. For example, building machine learning applications, crawling the web for information about business topics, extracting data from public documents like SEC filings, building customized search engines for research uses, etc. A specific is the processing of large amounts of data that we receive every night from the Options Price Reporting Authority (OPRA). Our current approach parses the data using C and the Linux grep command, then sorts it. This all be could be done directly in the Hadoop environment, and would presumably scale linearly with the number of nodes.

Getting Started

To use hadoop, please look at these instructions.

Running hadoop, hive and mahout at the Stern Center for Research Computing

First, you must have your Stern userid enabled for hadoop, and a hadoop user  created for your userid. To do that, please contact the Stern IT helpdesk. and create a ticket for research computing.

Second, you need to login to rnd and issue the command

.  /etc/profile.d/hadoop.sh

for the bash shell, or

source /etc/profile.d/hadoop.csh

for the tsch or csh shell.

At that point, the hadoop, hive and mahout binaries are in your path, and all of the environment variables have been set.

You should be able to run hadoop map-reduce jobs, hive and mahout.

To start, type

hadoop fs -lsr hdfs:///

and you should get a list of all the files in the file system.

hive

will enter the hive command line environment

mahout

will give you access to the mahout command line.