When importing data through a data connector you may sometimes find a column named something like `col_1` when you expected a name of `col`. In this blog post we will cover why this happens, preventative measures to take before importing to ensure it doesn't happen, as well as how to fix the issue post import after it has happened. # Why does this happen? First, lets dig into why this issue happens in the first place. The best way to understand what is happening is through a demonstration, so lets run a data connector import and walk through what is happening with the data. In this example we will use the TD Toolbelt command line to import files using the [Amazon S3 Import Connector](https://docs.treasuredata.com/display/public/INT/Amazon+S3+Import+Integration+v1). We prepared the following yml file to import files on S3. Note that the first letter of the column name is capitalized in the column configuration. `load_from_s3_blog.yml` ```yaml load_from_s3_blog.yml in: type: s3 access_key_id: XXXXXXX secret_access_key: XXXXXXXXX bucket: kohkitest path_prefix: test_folder/testfile.csv parser: charset: UTF-8 newline: LF type: csv delimiter: "," quote: "\"" escape: "\"" trim_if_not_quoted: false skip_header_lines: 1 allow_extra_columns: false allow_optional_columns: false columns: - {name: Id, type: long} - {name: Name , type: string} - {name: Company_id, type: long} out: {mode: append} exec: {} filters: - from_value: {mode: upload_time} to_column: {name: time} type: add_time ``` The target file, `testfile.csv`, placed on S3, is shown below. ```json id,name,company_id 1,Taro,111111 2,Hanako,22222 ``` The following table will be created after importing this .yml file. ![](/assets/yml_import.66767d37b5d91a62c69e0bdb90de2b744443c1892f780941116cf5b617ef5386.9e716762.png) Looking at the Schema tab, it appears that the values for **COLUMN** and **QUERY AS** are different, with **COLUMN** set to the column name originally defined in the .yml file and **QUERY AS** set to the column name defined in all lowercase letters. ![image](/assets/schema_1.9a43cc30d49ae7cf13d47e88c2f06d91fb9edfc98b108d3f52333a285315bc4e.9e716762.png) The reason is because Treasure Data allows only lower-case alphanumeric characters `[a-z0-9_]` and `_` (underscore) in column names, so the actual column name defined in the .yml file is retained as **COLUMN**, and the original column name in the form of Alias called **QUERY AS** is retained as the column name on the table. It is the latter column name as Alias that is specified in SQL. You can find more details about this in the [Naming requirements and restrictions in Treasure Data](https://docs.treasuredata.com/display/PD/Naming+Requirements+and+Restrictions+for+Treasure+Data+Entities#NamingRequirementsandRestrictionsforTreasureDataEntities-Columns) documentation. The important snippet is highlighted below: > only lower case letters, numbers, the _ (underscore) Next, let's change the .yml file to align the column name definitions in columns to lowercase. `load_from_s3_blog.yml` ```yml load_from_s3_blog.yml in: type: s3 access_key_id: XXXXXXX secret_access_key: XXXXXXXXX bucket: kohkitest path_prefix: test_folder/testfile.csv parser: charset: UTF-8 newline: LF type: csv delimiter: "," quote: "\"" escape: "\"" trim_if_not_quoted: false skip_header_lines: 1 allow_extra_columns: false allow_optional_columns: false columns: - {name: id, type: long} - {name: name , type: string} - {name: company_id, type: long} out: {mode: append} exec: {} filters: - from_value: {mode: upload_time} to_column: {name: time} type: add_time ``` After performing the import again in this way, the destination table looked like this: ![image](/assets/import_2.a1c75c23f7f94fdb0e282e9fd09dd091c4e9ac09c15be84560fbb012f5aad222.9e716762.png) Checking the Schema tab, it seems that for column #6, a column named `name_1` has been created with data in that column. ![image](/assets/schema_2.dd56c79c2d714df183b4d59f6d0c543f00d8fb0509446379fd432d29690a4a7f.9e716762.png) The reason for this behavior is that the combination of the **COLUMN** and **QUERY AS** values explained earlier—that is, the combination of the actual column name and the column name as Alias—is different, so the column is judged to be a separate column and consequently column name_1 is created. # How to solve this There are two ways to solve this problem. The first is to prevent it from happening before you import the data by using the `rename` filter in your .yml file. The second is to fix the issue post data import by running a SQL query to reconfigure the data. ## Solution: Prevention with `rename` filter To prevent this from happening, it is safe to use the [rename](https://docs.treasuredata.com/display/PD/rename+Filter+Function) filter function. This function automatically renames actual column names defined in the .yml according to the specified rules. It is generally a good idea to use this function to rename column name definitions to lowercase. Below you can find an example of how to rewrite the .yml file to include the rename filter function. ```yml in: type: s3 access_key_id: XXXXXXX secret_access_key: XXXXXXXXX bucket: kohkitest path_prefix: test_folder/testfile.csv parser: charset: UTF-8 newline: LF type: csv delimiter: "," quote: "\"" escape: "\"" trim_if_not_quoted: false skip_header_lines: 1 allow_extra_columns: false allow_optional_columns: false columns: - {name: Id, type: long} - {name: Name , type: string} - {name: Company_id, type: long} filters: - type: rename rules: - rule: upper_to_lower - from_value: {mode: upload_time} to_column: {name: time} type: add_time out: {mode: append} exec: {} ``` If you import data using the above configuration you will see that both columns are lowercase, as intended, with no extra columns added. ![image](/assets/schema_3.768ce10757192f3479a6f37073ebb2bbbafdd82a90c76b5d2140d272bfee78cf.9e716762.png) By utilizing the rename filter in this way, you can rename column names based on your settings. ## Solution: Post data import SQL query If a column like `col_1` already exists, it is easy to rebuild the table by executing a SQL query against the table in which column `name_1` occurred and store the data again with **COLUMN** and **QUERY AS** all in lower case. In this example, the following query would be executed to rebuild the table so that all data is stored in the lowercase columns. ```sql drop table if exists hatena_blog_test2; create table if not exists hatena_blog_test2 as select time,id,name,company_id from hatena_blog_test where id is not null union all select time, id_1 as id, name_1 as name, company_id_1 as company_id from hatena_blog_test where id_1 is not null ``` After creation, the table will look like this, with both **COLUMN** and **QUERY AS** aligned in lowercase. The table is now complete. ![image](/assets/schema_4.8620fdb937ac82ece182f397071050d662167792f9caabeeae6f02ff8fdaf168.9e716762.png) ### Appendix: Changing column names from Sources in the GUI When importing from Sources on the Treasure Data console, the Schema Settings in the **Data Settings** section are used to define the actual column names set in the .yml file. You can change the column name to any name by clicking the column name. ![image](/assets/edit_column_name_from_ui.65c5463cfc60cef396cd7c5e959671825d788adb029dfb74e82aa13846980f29.9e716762.png) In the case of jobs imported from Sources, if there are column names that contain uppercase letters, etc., the user does not have to specify them, but the rename filter is applied during the import process. ![image](/assets/job_activities.e441242f57733490c2529be888d913ca1a5063410f2c8c4ca896a7fbde0bc88e.9e716762.png) # Summary That's it. We've covered why this happens, how to prevent it,and how to clean up after you've already encountered the issue. We strongly suggest the liberal use of the rename filter as an ounce of prevention is worth a pound of cure.