ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • CS50 Week6: Problem Set, DNA
    Programming/CS50 2023. 7. 21. 15:07

    하버드 CS50 강의 6주차 Problem Set 과제 DNA 의 풀이를 다룹니다.
    DNA 검식 결과를 바탕으로 사람을 찾아내는 과제입니다.
    csv 데이터를 다루기 위해 파이썬의 판다스 라이브러리를 사용한 풀이입니다.

    Task

    $ python dna.py databases/large.csv sequences/5.txt
    Lavender

    Code

    예전에 학교 수업에서 판다스를 써본 적이 있어서 이번에도 써 봤다.

    import sys
    import pandas as pd
    
    def main():
        # TODO: Check for command-line usage
        if len(sys.argv) != 3:
            print("Usage python dna.py data.csv sequence.txt")
            sys.exit(1)
    
        # TODO: Read database file into a variable
        DNA_DB = pd.read_csv(sys.argv[1], index_col="name")
    
        # TODO: Read DNA sequence file into a variable
        with open(sys.argv[2], 'r') as file:
            DNA_sequence = file.read()
    
        # TODO: Find longest match of each STR in DNA sequence
        subsequences = DNA_DB.columns
        profile = [longest_match(DNA_sequence, subsequence) for subsequence in subsequences]
    
        # TODO: Check database for matching profiles
        match = DNA_DB.loc[DNA_DB.apply(lambda row: row.tolist() == profile, axis=1)].index
        if len(match) == 0:
            print("No match")
        else:
            print(match[0])
    
        return
    
    
    def longest_match(sequence, subsequence):
        """Returns length of longest run of subsequence in sequence."""
    
        # Initialize variables
        longest_run = 0
        subsequence_length = len(subsequence)
        sequence_length = len(sequence)
    
        # Check each character in sequence for most consecutive runs of subsequence
        for i in range(sequence_length):
    
            # Initialize count of consecutive runs
            count = 0
    
            # Check for a subsequence match in a "substring" (a subset of characters) within sequence
            # If a match, move substring to next potential match in sequence
            # Continue moving substring and checking for matches until out of consecutive matches
            while True:
    
                # Adjust substring start and end
                start = i + count * subsequence_length
                end = start + subsequence_length
    
                # If there is a match in the substring
                if sequence[start:end] == subsequence:
                    count += 1
    
                # If there is no match in the substring
                else:
                    break
    
            # Update most consecutive matches found
            longest_run = max(longest_run, count)
    
        # After checking for runs at each character in seqeuence, return longest run found
        return longest_run
    
    main()

    댓글