Question: Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The latest

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The latest version contains a repertoire of 136,755 characters covering 139 modern and historic scripts, as well as multiple symbol sets.

Java, current web browsers, and all relatively modern operating systems, represent text in Unicode.

Our server has a file file located at /usr/share/unicode/UnicodeData.txt or via HTTP: "http://jeff.cis.cabrillo.edu/datasets/UnicodeData.txt" or "shorturl.at/abiKR" or "https://mega.nz/#!stglgJhS!SGdGCZD7Kzrn39yRl6ym3hI4ywX_lXhcHbmwS7nNHlA" . This is one of the data files for the Unicode Character Database, namely for Version 10.0.0 of the Unicode Standard. Each line of this file contains up to 15 semicolon-delimited facts about each Unicode character:

The character's code point, i.e. integer value, expressed in hexadecimal

The character's name

followed by up to 13 more facts about about the character that are irrelevant to this assignment

Here is a preview of 10 lines from this file, namely the section describing characters Y through b:

0059;LATIN CAPITAL LETTER Y;Lu;0;L;;;;;N;;;;0079; 005A;LATIN CAPITAL LETTER Z;Lu;0;L;;;;;N;;;;007A; 005B;LEFT SQUARE BRACKET;Ps;0;ON;;;;;Y;OPENING SQUARE BRACKET;;;; 005C;REVERSE SOLIDUS;Po;0;ON;;;;;N;BACKSLASH;;;; 005D;RIGHT SQUARE BRACKET;Pe;0;ON;;;;;Y;CLOSING SQUARE BRACKET;;;; 005E;CIRCUMFLEX ACCENT;Sk;0;ON;;;;;N;SPACING CIRCUMFLEX;;;; 005F;LOW LINE;Pc;0;ON;;;;;N;SPACING UNDERSCORE;;;; 0060;GRAVE ACCENT;Sk;0;ON;;;;;N;SPACING GRAVE;;;; 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041 0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042

Assignment

You shall write a program that accepts one command-line argument and prints information on each Unicode character described in file /usr/share/unicode/UnicodeData.txt whose name contains the argument string, case-insensitive. Characters shall be printed one per line, in ascending order of their names, in the following format:

name: character

(That is, the name of the character, followed by a colon and a space, followed by the character itself.)

Examples

An executable named cs12j_unicode_search_sorted on our server replicates the expected behavior of your program. It and your program should generate the following output given the same command-line arguments:

Command-line argument: smiling

Output:

BLACK SMILING FACE: ? GRINNING CAT FACE WITH SMILING EYES: ?? GRINNING FACE WITH SMILING EYES: ?? KISSING FACE WITH SMILING EYES: ?? SLIGHTLY SMILING FACE: ?? SMILING CAT FACE WITH HEART-SHAPED EYES: ?? SMILING CAT FACE WITH OPEN MOUTH: ?? SMILING FACE WITH HALO: ?? SMILING FACE WITH HEART-SHAPED EYES: ?? SMILING FACE WITH HORNS: ?? SMILING FACE WITH OPEN MOUTH: ?? SMILING FACE WITH OPEN MOUTH AND COLD SWEAT: ?? SMILING FACE WITH OPEN MOUTH AND SMILING EYES: ?? SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES: ?? SMILING FACE WITH SMILING EYES: ?? SMILING FACE WITH SMILING EYES AND HAND COVERING MOUTH: ?? SMILING FACE WITH SUNGLASSES: ?? WHITE SMILING FACE: ?

Command-line argument: water

Output:

ALCHEMICAL SYMBOL FOR WATER: ?? CIRCLED IDEOGRAPH WATER: ? CJK RADICAL WATER ONE: ? CJK RADICAL WATER TWO: ? HEXAGRAM FOR THE ABYSMAL WATER: ? KANGXI RADICAL WATER: ? NON-POTABLE WATER SYMBOL: ?? PARENTHESIZED IDEOGRAPH WATER: ? POTABLE WATER SYMBOL: ?? TRIGRAM FOR WATER: ? WATER BUFFALO: ?? WATER CLOSET: ?? WATER POLO: ?? WATER WAVE: ?? WATERMELON: ??

Command-line argument: 'alchemical sym' (note the single quotes)

Output (first 10 of 116 total lines):

ALCHEMICAL SYMBOL FOR AIR: ?? ALCHEMICAL SYMBOL FOR ALEMBIC: ?? ALCHEMICAL SYMBOL FOR ALKALI: ?? ALCHEMICAL SYMBOL FOR ALKALI-2: ?? ALCHEMICAL SYMBOL FOR ALUM: ?? ALCHEMICAL SYMBOL FOR AMALGAM: ?? ALCHEMICAL SYMBOL FOR ANTIMONY ORE: ?? ALCHEMICAL SYMBOL FOR AQUA REGIA: ?? ALCHEMICAL SYMBOL FOR AQUA REGIA-2: ?? ALCHEMICAL SYMBOL FOR AQUA VITAE: ??

Tips

Method java.lang.Integer.parseInt can easily handle converting strings representing binary, octal, decimal, and hexadecimal integers to their corresponding int values. Simply pass a second argument of 2, 8, 10 (or use regular Integer.parseInt), or 16, respectively.

Method split in java.lang.String is a convenient way of separating the various semicolon-delimited parts of each line from the file, e.g. someLine.split(;).

java.lang.String has several non-static methods that can identify whether one string can be found within another.

Remember that java.util.ArrayList has methods like sort, get, and indexOf.

Many of the characters described in the data file exceed char's range, and thus fall under the category of "supplementary characters" in Java. As such, make sure to use printf or String.format to print the character and use the int type to represent the code point in question, e.g.:

System.out.printf("%c %c %c", 0x1f4a9, 128169, Integer.parseInt("1F4A9", 16)); // same character 3x

Depending on your operating system, terminal emulator, and settings thereof, you may not always be able to see all characters rendered appropriately, or they may look different than on this webpage, etc. Don't worry too much about that, as long as your output looks right for other characters.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!