Top Level Namespace
- Includes:
- FileUtils, Wukong::Streamer
Defined Under Namespace
Modules: Enumerable, HadoopBinning, HashLike, Monkeyshines, Size, WordCount, Wukong Classes: Array, BadRecord, BigDecimal, Bignum, Blob, Boolean, Class, Csv, Date, DateTime, EpochTime, FalseClass, FilePath, Fixnum, Flag, Float, Hash, IPAddress, Integer, Json, Mapper, NilClass, Numeric, Object, Pathname, PeriodicMonitor, Regex, String, Subdir, Summer, Symbol, Text, Time, TrueClass, TypedStruct, URI, WcMapper, WcReducer, Yaml
Constant Summary
- OPTIONS =
{}
- OUTPUT_LINE_FMT =
"%-71s\t%15d\t%15s"- NEWLINE_LENGTH =
KLUDGE
$/.length
- USAGE =
%Q{ # h1. wulign -- format a tab-separated file as aligned columns # # wulign will intelligently reformat a tab-separated file into a tab-separated, # space aligned file that is still suitable for further processing. For example, # given the log-file input # # # cat tag_usage.tsv # 2009-07-21T21:39:40 day 65536 3.15479 68750 1171316 # 2009-07-21T21:39:45 doing 65536 1.04533 26230 1053956 # 2009-07-21T21:41:53 hapaxlegomenon 65536 0.87574e-05 23707 10051141 # 2009-07-21T21:44:00 concert 500 0.29290 13367 9733414 # 2009-07-21T21:44:29 world 65536 1.09110 32850 200916 # 2009-07-21T21:44:39 world+series 65536 0.49380 9929 7972025 # 2009-07-21T21:44:54 iranelection 65536 2.91775 14592 136342 # # wulign will reformat it to read # # # cat tag_usage.tsv | wu-lign # 2009-07-21T21:39:40 day 65536 3.154791234 68750 1171316 # 2009-07-21T21:39:45 doing 65536 1.045330000 26230 1053956 # 2009-07-21T21:41:53 hapaxlegomenon 65536 0.000008757 23707 10051141 # 2009-07-21T21:44:00 concert 500 0.292900000 13367 9733414 # 2009-07-21T21:44:29 world 65536 1.091100000 32850 200916 # 2009-07-21T21:44:39 world+series 65536 0.493800000 9929 7972025 # 2009-07-21T21:44:54 iranelection 65536 2.917750000 14592 136342 # # The fields are still tab-delimited by exactly one tab -- only spaces are used to # pad out fields. You can still use cuttab and friends to manipulate columns. # # h2. Command-line arguments # # You can give sprintf-style positional arguments on the command line that will be # applied to the corresponding columns. (Blank args are used for placeholding and # auto-formatting is still applied). So with the example above, # # cat foo | wulign '' '' '' '%8.4e' # # will format the fourth column with "%8.4e", while the first three columns and # fifth-and-higher columns are formatted as usual. # # ... # 2009-07-21T21:39:45 doing 65536 1.0453e+00 26230 1053956 # 2009-07-21T21:41:53 hapaxlegomenon 65536 8.7574e-06 23707 10051141 # 2009-07-21T21:44:00 concert 500 2.9290e-01 13367 9733414 # .... # # h2. How it works # # Wu-lign takes the first 500ish lines, splits into fields on TAB characters, # and tries to guess the format (int, float, or string) for each. It builds a # consensus of the width and type for corresponding columns in the chunk. If a # column has mixed numeric and string formats it degrades to :mixed, which is # basically treated as :string. If a column has mixed :float and :int elements all # of them are formatted as float. # # h2. Notes # # * Header rows: the first line is used for width alignment but not for type detection. # This means that an initial row of text headers will inform column spacing # but still allow a column of floats (say) to be properly aligned as floats. # # * It requires a unanimous vote. One screwy line can coerce the whole mess to # :mixed; width formatting will still be applied, though. # # * It won't set columns wider than 100 chars -- this allows for the occasional # super-wide column without completely breaking your screen. # # * For :float values, wulign tries to guess at the right number of significant # digits to the left and right of the decimal point. # # * wulign parses only plain-jane 'TSV files': no quoting or escaping; every tab # delimits a field, every newline a record. # # wulign isn't intended to be smart, or correct, or reliable -- only to be # useful for previewing and organizing tab-formatted files. In general # wulign(foo).split("\t").map(&:strip) *should* give output semantically # equivalent to its input. (That is, the only changes should be insertion of # spaces and re-formatting of numerics.) But still -- reserve its use for human # inspection only. # }- FORMAT_GUESSING_LINES =
How many initial lines to use to guess formatting. Lines after this are simply reformatted according to the consensus of the initial FORMAT_GUESSING_LINES.
500- MAX_MAX_WIDTH =
widest column to set
100- INT_RE =
/\A[\d,]+\z/- FLOAT_RE =
/\A([\d,]+)(?:\.(\d+))?(?:e-?\d+)?\z/- LINES =
# Logging
MB = 1024*1024 LOG_INTERVAL = 100_000 $start = Time.now; $iter = 0; $size = 0 def log_line
elapsed = (Time.now - $start).to_f $stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])end
Set.new
- Log =
Wukong.logger
- EMR_CONFIG_DIR =
'~/.wukong'
Instance Method Summary (collapse)
- - (Object) consensus_type(val, alltype, is_first)
- - (Object) dump_header(row, maxw)
- - (Object) dump_row(row, format)
- - (Object) f_width(str)
- - (Object) format_output(file, size)
- - (Object) get_type(val)
-
- (Object) number_to_human_size(size, precision = 1)
Formats the bytes in size into a more understandable representation (e.g., giving it 1500 yields 1.5 KB).
-
- (Object) prepare_command
Prepare command.
Instance Method Details
- (Object) consensus_type(val, alltype, is_first)
112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'bin/wu-lign', line 112 def consensus_type val, alltype, is_first return :mixed if alltype == :mixed type = get_type(val) or return case when alltype.nil? then type when is_first && (alltype == :str) then type when alltype == type then type when ( ((alltype==:float) && (type == :int)) || ((alltype == :int) && (type == :float)) ) :float else :mixed end end |
- (Object) dump_header(row, maxw)
172 173 174 |
# File 'bin/wu-lign', line 172 def dump_header row, maxw puts row.zip(maxw).map{|col, width| "%-#{width}s" % col.to_s }.join("\t") end |
- (Object) dump_row(row, format)
169 170 171 |
# File 'bin/wu-lign', line 169 def dump_row row, format puts row.zip(format).map{|c,f| f.call(c) rescue c }.join("\t") end |
- (Object) f_width(str)
125 126 127 128 |
# File 'bin/wu-lign', line 125 def f_width str str =~ FLOAT_RE or return 0 [$1.length, $2 ? $2.length : 0] end |
- (Object) format_output(file, size)
70 71 72 73 74 |
# File 'bin/hdp-du', line 70 def format_output file, size human_size = number_to_human_size(size) || "" file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone OUTPUT_LINE_FMT % [file, size.to_i, human_size] end |
- (Object) get_type(val)
104 105 106 107 108 109 110 |
# File 'bin/wu-lign', line 104 def get_type val case when val == '' then type = nil when val =~ INT_RE then type = :int when val =~ FLOAT_RE then type = :float else type = :str end end |
- (Object) number_to_human_size(size, precision = 1)
Formats the bytes in size into a more understandable representation (e.g., giving it 1500 yields 1.5 KB). This method is useful for reporting file sizes to users. This method returns nil if size cannot be converted into a number. You can change the default precision of 1 using the precision parameter precision.
Examples
number_to_human_size(123) # => 123 Bytes
number_to_human_size(1234) # => 1.2 KB
number_to_human_size(12345) # => 12.1 KB
number_to_human_size(1234567) # => 1.2 MB
number_to_human_size(1234567890) # => 1.1 GB
number_to_human_size(1234567890123) # => 1.1 TB
number_to_human_size(1234567, 2) # => 1.18 MB
number_to_human_size(483989, 0) # => 4 MB
55 56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'bin/hdp-du', line 55 def number_to_human_size(size, precision=1) size = Kernel.Float(size) case when size.to_i == 1; "1 Byte" when size < 1.kilobyte; "%d Bytes" % size when size < 1.megabyte; "%.#{precision}f KB" % (size / 1.0.kilobyte) when size < 1.gigabyte; "%.#{precision}f MB" % (size / 1.0.megabyte) when size < 1.terabyte; "%.#{precision}f GB" % (size / 1.0.gigabyte) else "%.#{precision}f TB" % (size / 1.0.terabyte) end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ') rescue nil end |
- (Object) prepare_command
Prepare command
17 18 19 20 21 |
# File 'bin/hdp-du', line 17 def prepare_command dfs_cmd = OPTIONS[:summary] ? 'dus' : 'du' dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'" %Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} } end |