Top Level Namespace

Includes:
FileUtils, Wukong::Streamer

Defined Under Namespace

Modules: Enumerable, HadoopBinning, HashLike, Monkeyshines, Size, WordCount, Wukong Classes: Array, BadRecord, BigDecimal, Bignum, Blob, Boolean, Class, Csv, Date, DateTime, EpochTime, FalseClass, FilePath, Fixnum, Flag, Float, Hash, IPAddress, Integer, Json, Mapper, NilClass, Numeric, Object, Pathname, PeriodicMonitor, Regex, String, Subdir, Summer, Symbol, Text, Time, TrueClass, TypedStruct, URI, WcMapper, WcReducer, Yaml

Constant Summary

OPTIONS =
{}
OUTPUT_LINE_FMT =
"%-71s\t%15d\t%15s"
NEWLINE_LENGTH =

KLUDGE

$/.length
USAGE =
%Q{
# h1. wulign -- format a tab-separated file as aligned columns
#
# wulign will intelligently reformat a tab-separated file into a tab-separated,
# space aligned file that is still suitable for further processing. For example,
# given the log-file input
#
#     # cat tag_usage.tsv
#     2009-07-21T21:39:40 day     65536   3.15479 68750   1171316
#     2009-07-21T21:39:45 doing   65536   1.04533 26230   1053956
#     2009-07-21T21:41:53 hapaxlegomenon  65536   0.87574e-05     23707   10051141
#     2009-07-21T21:44:00 concert 500     0.29290 13367   9733414
#     2009-07-21T21:44:29 world   65536   1.09110 32850   200916
#     2009-07-21T21:44:39 world+series    65536   0.49380 9929    7972025
#     2009-07-21T21:44:54 iranelection    65536   2.91775 14592   136342
#
# wulign will reformat it to read
#
#     # cat tag_usage.tsv | wu-lign
#     2009-07-21T21:39:40 day                   65536   3.154791234 68750    1171316
#     2009-07-21T21:39:45 doing                 65536   1.045330000 26230    1053956
#     2009-07-21T21:41:53 hapaxlegomenon        65536   0.000008757 23707   10051141
#     2009-07-21T21:44:00 concert                 500   0.292900000 13367    9733414
#     2009-07-21T21:44:29 world                 65536   1.091100000 32850     200916
#     2009-07-21T21:44:39 world+series          65536   0.493800000  9929    7972025
#     2009-07-21T21:44:54 iranelection          65536   2.917750000 14592     136342
#
# The fields are still tab-delimited by exactly one tab -- only spaces are used to
# pad out fields. You can still use cuttab and friends to manipulate columns.
#
# h2. Command-line arguments
#
# You can give sprintf-style positional arguments on the command line that will be
# applied to the corresponding columns. (Blank args are used for placeholding and
# auto-formatting is still applied).  So with the example above,
#
#     cat foo | wulign  '' '' '' '%8.4e'
#
# will format the fourth column with "%8.4e", while the first three columns and
# fifth-and-higher columns are formatted as usual.
#
#     ...
#     2009-07-21T21:39:45 doing           65536   1.0453e+00      26230    1053956
#     2009-07-21T21:41:53 hapaxlegomenon  65536   8.7574e-06      23707   10051141
#     2009-07-21T21:44:00 concert           500   2.9290e-01      13367    9733414
#     ....
#
# h2. How it works
#
# Wu-lign takes the first 500ish lines, splits into fields on TAB characters,
# and tries to guess the format (int, float, or string) for each. It builds a
# consensus of the width and type for corresponding columns in the chunk.  If a
# column has mixed numeric and string formats it degrades to :mixed, which is
# basically treated as :string. If a column has mixed :float and :int elements all
# of them are formatted as float.
#
# h2. Notes
#
# * Header rows: the first line is used for width alignment but not for type detection.
#   This means that an initial row of text headers will inform column spacing
#   but still allow a column of floats (say) to be properly aligned as floats.
#
# * It requires a unanimous vote. One screwy line can coerce the whole mess to
#   :mixed; width formatting will still be applied, though.
#
# * It won't set columns wider than 100 chars -- this allows for the occasional
#   super-wide column without completely breaking your screen.
#
# * For :float values, wulign tries to guess at the right number of significant
#   digits to the left and right of the decimal point.
#
# * wulign parses only plain-jane 'TSV files': no quoting or escaping; every tab
#   delimits a field, every newline a record.
#
# wulign isn't intended to be smart, or correct, or reliable -- only to be
# useful for previewing and organizing tab-formatted files. In general
# wulign(foo).split("\t").map(&:strip) *should* give output semantically
# equivalent to its input. (That is, the only changes should be insertion of
# spaces and re-formatting of numerics.) But still -- reserve its use for human
# inspection only.
#
}
FORMAT_GUESSING_LINES =

How many initial lines to use to guess formatting. Lines after this are simply reformatted according to the consensus of the initial FORMAT_GUESSING_LINES.

500
MAX_MAX_WIDTH =

widest column to set

100
INT_RE =
/\A[\d,]+\z/
FLOAT_RE =
/\A([\d,]+)(?:\.(\d+))?(?:e-?\d+)?\z/
LINES =

# Logging

MB = 1024*1024 LOG_INTERVAL = 100_000 $start = Time.now; $iter = 0; $size = 0 def log_line

elapsed = (Time.now - $start).to_f
$stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])

end

Set.new
Log =
Wukong.logger
EMR_CONFIG_DIR =
'~/.wukong'

Instance Method Summary (collapse)

Instance Method Details

- (Object) consensus_type(val, alltype, is_first)



112
113
114
115
116
117
118
119
120
121
122
123
# File 'bin/wu-lign', line 112

def consensus_type val, alltype, is_first
  return :mixed if alltype == :mixed
  type = get_type(val) or return
  case
  when alltype.nil?                  then type
  when is_first && (alltype == :str) then type
  when alltype == type               then type
  when ( ((alltype==:float) && (type == :int)) || ((alltype == :int) && (type == :float)) )
    :float
  else :mixed
  end
end

- (Object) dump_header(row, maxw)



172
173
174
# File 'bin/wu-lign', line 172

def dump_header row, maxw
  puts row.zip(maxw).map{|col, width| "%-#{width}s" % col.to_s }.join("\t")
end

- (Object) dump_row(row, format)



169
170
171
# File 'bin/wu-lign', line 169

def dump_row row, format
  puts row.zip(format).map{|c,f| f.call(c) rescue c }.join("\t")
end

- (Object) f_width(str)



125
126
127
128
# File 'bin/wu-lign', line 125

def f_width str
  str =~ FLOAT_RE or return 0
  [$1.length, $2 ? $2.length : 0]
end

- (Object) format_output(file, size)



70
71
72
73
74
# File 'bin/hdp-du', line 70

def format_output file, size
  human_size = number_to_human_size(size) || ""
  file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone
  OUTPUT_LINE_FMT % [file, size.to_i, human_size]
end

- (Object) get_type(val)



104
105
106
107
108
109
110
# File 'bin/wu-lign', line 104

def get_type val
  case
  when val == ''       then type = nil
  when val =~ INT_RE   then type = :int
  when val =~ FLOAT_RE then type = :float
  else                      type = :str end
end

- (Object) number_to_human_size(size, precision = 1)

Formats the bytes in size into a more understandable representation (e.g., giving it 1500 yields 1.5 KB). This method is useful for reporting file sizes to users. This method returns nil if size cannot be converted into a number. You can change the default precision of 1 using the precision parameter precision.

Examples

number_to_human_size(123)           # => 123 Bytes
number_to_human_size(1234)          # => 1.2 KB
number_to_human_size(12345)         # => 12.1 KB
number_to_human_size(1234567)       # => 1.2 MB
number_to_human_size(1234567890)    # => 1.1 GB
number_to_human_size(1234567890123) # => 1.1 TB
number_to_human_size(1234567, 2)    # => 1.18 MB
number_to_human_size(483989, 0)     # => 4 MB


55
56
57
58
59
60
61
62
63
64
65
66
67
# File 'bin/hdp-du', line 55

def number_to_human_size(size, precision=1)
  size = Kernel.Float(size)
  case
  when size.to_i == 1;    "1 Byte"
  when size < 1.kilobyte; "%d Bytes" % size
  when size < 1.megabyte; "%.#{precision}f KB"  % (size / 1.0.kilobyte)
  when size < 1.gigabyte; "%.#{precision}f MB"  % (size / 1.0.megabyte)
  when size < 1.terabyte; "%.#{precision}f GB"  % (size / 1.0.gigabyte)
  else                    "%.#{precision}f TB"  % (size / 1.0.terabyte)
  end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
rescue
  nil
end

- (Object) prepare_command

Prepare command



17
18
19
20
21
# File 'bin/hdp-du', line 17

def prepare_command
  dfs_cmd  = OPTIONS[:summary] ? 'dus' : 'du'
  dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'"
  %Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} }
end