Solutions for Upgrading Apache DolphinScheduler from Version 1.3.4 to 3.1.2

by William GuoAugust 8th, 2024

Too Long; Didn't Read

Apache DolphinScheduler is being upgraded from 1.3.4 to 3.1.2. The official upgrade documentation provides the upgrade scripts. For minor version updates, executing the script suffices, but upgrading across multiple major versions can still lead to various issues.

featured image - Solutions for Upgrading Apache DolphinScheduler from Version 1.3.4 to 3.1.2

Due to the need to promote the upgrade of Apache DolphinScheduler at work, preliminary research that I have done shows significant improvements from version 1.3.4 to 3.1.2, with substantial enhancements in performance and functionality. So, an upgrade is recommended.

The official upgrade documentation provides the upgrade scripts. For minor version updates, executing the script suffices, but upgrading across multiple major versions can still lead to various issues. The following is a summary of these issues.

Old Version: 1.3.4 New Version: 3.1.2

Issue Collection

Resource Center Error

After the upgrade, using the resource center results in an error: IllegalArgumentException: Failed to specify server's Kerberos principal name.

The resource center uses HDFS with Kerberos authentication enabled.

Solution:

Edit dolphinscheduler/api-server/conf/hdfs-site.xml and add the following content:

<property>
    <name>dfs.namenode.kerberos.principal.pattern</name>
    <value>*</value>
</property>

2. Task Instance Log Loss

After the upgrade, viewing task instance logs results in an error indicating that the logs cannot be found. By checking the error message and the new version’s directory structure and log paths, it is found that the log path has changed.

Before upgrade: log path is under /logs/
After the upgrade: the log path is under /worker-server/logs/

Solution: Execute SQL to modify the log path

update t_ds_task_instance set log_path=replace(log_path,'/logs/','/worker-server/logs/');

Then copy the original log files to the new log path:

cp -r {old_dolphinscheduler_directory}/logs/[1-9]* {new_dolphinscheduler_directory}/worker-server/logs/*

3. Error Creating Workflow After Upgrade

The error message indicates that the initial values of the primary keys in t_ds_process_definition_log and t_ds_process_definition are inconsistent. Synchronizing these values resolves the issue.

Solution: Execute the following SQL

-- Get the auto-increment value of the primary key
select AUTO_INCREMENT FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'dolphinscheduler' AND TABLE_NAME = 't_ds_process_definition' limit 1;

-- Use the result of the above SQL to execute the following
alter table dolphinscheduler_bak1.t_ds_process_definition_log auto_increment = {max_id};

4. Empty Task Instance List After Upgrade

By checking the SQL query in dolphinscheduler-dao/src/main/resources/org/apache/dolphinscheduler/dao/mapper/TaskInstanceMapper.xml, it is found that the SQL for querying task instance lists references the t_ds_task_definition_log table, but the join condition define.code=instance.task_code does not match.

Considering the condition define.project_code = #{projectCode}, the join with t_ds_task_definition_log is mainly for filtering by projectCode. Modify the SQL as follows:

Solution:

select
    <include refid="baseSqlV2">
        <property name="alias" value="instance"/>
    </include>
    ,
    process.name as process_instance_name
    from t_ds_task_instance instance
    -- left join t_ds_task_definition_log define 
    -- on define.code = instance.task_code and 
    -- define.version = instance.task_definition_version
    join t_ds_process_instance process
    on process.id = instance.process_instance_id
    join t_ds_process_definition define
    on define.code = process.process_definition_code
    where define.project_code = #{projectCode}
    <if test="startTime != null">
        and instance.start_time <![CDATA[ >= ]] > #{startTime}
    </if>

By directly joining with t_ds_process_definition, which also contains the project_code field for filtering, the query will return the correct data.

5. Null Pointer Exception During Upgrade Script Execution

(1) Analyze the log and locate the issue at line 517 in UpgradeDao.java

Original Code:

513 if (TASK_TYPE_SUB_PROCESS.equals(taskType)) {
514                       JsonNode jsonNodeDefinitionId = param.get("processDefinitionId");
515                       if (jsonNodeDefinitionId != null) {
516                           param.put("processDefinitionCode",
517                                  processDefinitionMap.get(jsonNodeDefinitionId.asInt()).getCode());
518                            param.remove("processDefinitionId");
519                        }
520                    }

The issue is that processDefinitionMap.get(jsonNodeDefinitionId.asInt()) returns null. Add a null check, and if null, log the relevant information for later review.

Solution:

if (jsonNodeDefinitionId != null) {
    if (processDefinitionMap.get(jsonNodeDefinitionId.asInt()) != null) {
        param.put("processDefinitionCode",processDefinitionMap.get(jsonNodeDefinitionId.asInt()).getCode());
        param.remove("processDefinitionId");
    } else {
        logger.error("*******************error");
        logger.error("*******************param:" + param);
        logger.error("*******************jsonNodeDefinitionId:" + jsonNodeDefinitionId);
    }
}

(2) Analyze the log and locate the issue at line 675 in UpgradeDao.java

Original Code:

669 if (mapEntry.isPresent()) {
670                            Map.Entry<Long, Map<String, Long>> processCodeTaskNameCodeEntry = mapEntry.get();
671                            dependItem.put("definitionCode", processCodeTaskNameCodeEntry.getKey());
672                            String depTasks = dependItem.get("depTasks").asText();
673                            long taskCode =
674                                    "ALL".equals(depTasks) || processCodeTaskNameCodeEntry.getValue() == null ? 0L
675                                            : processCodeTaskNameCodeEntry.getValue().get(depTasks);
676                            dependItem.put("depTaskCode", taskCode);
677                        }

The issue is that processCodeTaskNameCodeEntry.getValue().get(depTasks) returns null. Modify the logic to check for null before assigning the value and log the relevant information.

Solution:

long taskCode =0;
                            if (processCodeTaskNameCodeEntry.getValue() != null
                                    &&processCodeTaskNameCodeEntry.getValue().get(depTasks)!=null){
                                taskCode =processCodeTaskNameCodeEntry.getValue().get(depTasks);
                            }else{
                                logger.error("******************** depTasks:"+depTasks);
                                logger.error("******************** taskCode not in "+JSONUtils.toJsonString(processCodeTaskNameCodeEntry));
                            }
                            dependItem.put("depTaskCode", taskCode);

6. Login Failure After LDAP Integration, Uncertain Email Field Name

Configure LDAP integration in api-server/conf/application.yaml

security:
  authentication:
    # Authentication types (supported types: PASSWORD,LDAP)
    type: LDAP
    # IF you set type `LDAP`, below config will be effective
    ldap:
      # ldap server config
      urls: xxx
      base-dn: xxx
      username: xxx
      password: xxx
      user:
        # admin userId when you use LDAP login
        admin: xxx
        identity-attribute: xxx
        email-attribute: xxx
        # action when ldap user is not exist (supported types: CREATE,DENY)
        not-exist-action: CREATE

To successfully integrate LDAP, the fields urls, base-dn, username, password, identity, and email must be correctly filled. If the email field name is unknown, leave it empty initially.

After starting the service, attempt to log in with an LDAP user.

Solution: The LDAP authentication code is located in dolphinscheduler-api/src/main/java/org/apache/dolphinscheduler/api/security/impl/ldap/LdapService.java in the ldapLogin() method

ctx = new InitialLdapContext(searchEnv, null);
SearchControls sc = new SearchControls();
sc.setReturningAttributes(new String[]{ldapEmailAttribute});
sc.setSearchScope(SearchControls.SUBTREE_SCOPE);
EqualsFilter filter = new EqualsFilter(ldapUserIdentifyingAttribute, userId);
NamingEnumeration<SearchResult> results = ctx.search(ldapBaseDn, filter.toString(), sc);
if (results.hasMore()) {
    // get the users DN (distinguishedName) from the result
    SearchResult result = results.next();
    NamingEnumeration<? extends Attribute> attrs = result.getAttributes().getAll();
    while (attrs.hasMore()) {
        // Open another connection to the LDAP server with the found DN and the password
        searchEnv.put(Context.SECURITY_PRINCIPAL, result.getNameInNamespace());
        searchEnv.put(Context.SECURITY_CREDENTIALS, userPwd);
        try {
            new InitialDirContext(searchEnv);
        } catch (Exception e) {
            logger.warn("invalid ldap credentials or ldap search error", e);
            return null;
        }
        Attribute attr = attrs.next();
        if (attr.getID().equals(ldapEmailAttribute)) {
            return (String) attr.get();
        }
    }
}

The third line filters based on the specified email attribute field. Comment out this line temporarily:

// sc.setReturningAttributes(new String[]{ldapEmailAttribute});

Rerun the code; the tenth line will return all fields:

NamingEnumeration<? extends Attribute> attrs = result.getAttributes().getAll();

By printing or debugging, identify the correct email field and update the configuration file accordingly. Then, uncomment the code and restart the service to integrate LDAP login successfully.

7. Resource File Authorization Not Effective for Regular Users by Admin

After multiple tests, it is found that regular users can only see resource files owned by themselves. Authorization by the admin does not make the files visible.

Solution:

In the listAuthorizedResource method of ResourcePermissionCheckServiceImpl.java file located at dolphinscheduler-api/src/main/java/org/apache/dolphinscheduler/api/permission/, modify the return collection to relationResources:

@Override
        public Set<Integer> listAuthorizedResource(int userId, Logger logger) {
            List<Resource> relationResources;
            if (userId == 0) {
                relationResources = new ArrayList<>();
            } else {
                // query resource relation
                List<Integer> resIds = resourceUserMapper.queryResourcesIdListByUserIdAndPerm(userId, 0);
                relationResources = CollectionUtils.isEmpty(resIds) ? new ArrayList<>() : resourceMapper.queryResourceListById(resIds);
            }
            List<Resource> ownResourceList = resourceMapper.queryResourceListAuthored(userId, -1);
            relationResources.addAll(ownResourceList);
            return relationResources.stream().map(Resource::getId).collect(toSet()); // Fixed the issue that the resource file authorization was invalid
//            return ownResourceList.stream().map(Resource::getId).collect(toSet());
        }

Check the new version’s change log and find that this bug has been fixed in version 3.1.3: GitHub PR #13318

8. Kerberos Expiry Issues

Kerberos is configured with a ticket expiration time, causing the HDFS resources in the resource center to become inaccessible after a period. The best solution is to add logic for periodic credential renewal.

Solution:

Add the following method in CommonUtils.javalocated at dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/utils/:

/**
 * Periodically update credentials
 */
private static void startCheckKeytabTgtAndReloginJob() {
    // Periodically update credentials daily
    Executors.newScheduledThreadPool(1).scheduleWithFixedDelay(() -> {
        try {
            UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();
            logger.warn("Check Kerberos Tgt And Relogin From Keytab Finish.");
        } catch (IOException e) {
            logger.error("Check Kerberos Tgt And Relogin From Keytab Error", e);
        }
    }, 0, 1, TimeUnit.DAYS);
    logger.info("Start Check Keytab TGT And Relogin Job Success.");
}

Then call this method in the loadKerberosConfmethod before returning true:

public static boolean loadKerberosConf(String javaSecurityKrb5Conf, String loginUserKeytabUsername,
                                       String loginUserKeytabPath, Configuration configuration) throws IOException {
    if (CommonUtils.getKerberosStartupState()) {
        System.setProperty(Constants.JAVA_SECURITY_KRB5_CONF, StringUtils.defaultIfBlank(javaSecurityKrb5Conf,
                PropertyUtils.getString(Constants.JAVA_SECURITY_KRB5_CONF_PATH)));
        configuration.set(Constants.HADOOP_SECURITY_AUTHENTICATION, Constants.KERBEROS);
        UserGroupInformation.setConfiguration(configuration);
        UserGroupInformation.loginUserFromKeytab(
                StringUtils.defaultIfBlank(loginUserKeytabUsername,
                        PropertyUtils.getString(Constants.LOGIN_USER_KEY_TAB_USERNAME)),
                StringUtils.defaultIfBlank(loginUserKeytabPath,
                        PropertyUtils.getString(Constants.LOGIN_USER_KEY_TAB_PATH)));
        startCheckKeytabTgtAndReloginJob();  // Call here
        return true;
    }
    return false;
}

This article primarily records the issues encountered during the upgrade process and aims to help the community that facing similar challenges.

L O A D I N G
. . . comments & more!